[08:10:24] <elukey>	 hello folks
[08:10:28] <elukey>	 anybody working on netflow1002?
[08:11:07] <elukey>	 the ens5 interface seems to have no connectivity
[08:15:54] <elukey>	 ah it runs on ganeti1015, that was reimaged yesterday
[08:15:57] <elukey>	 moritzm: --^
[08:20:17] <moritzm>	 checking
[08:26:15] <moritzm>	 I don't believe it's an issue with the Ganeti server it's running on, there's also kubestagetcd1005 running on ganeti1015, which is fine. having a closer look
[08:33:17] <elukey>	 moritzm: ah okok didn't check the other vms, but the timings from icinga are suspicious (netflow1002 alarms since ~20h ago)
[09:10:00] <godog>	 btullis: in light of the thread re: job timers should we go back to the previous way for analytics timers?
[09:12:42] <btullis>	 godog: Do you mean abandoning this change for now? https://gerrit.wikimedia.org/r/c/operations/puppet/+/841924 or do you mean reverting to `monitoring_enabled` as oppsed to `send_email`?
[09:13:29] <godog>	 btullis: the latter
[09:13:44] <btullis>	 As an aside, in the longer run, we're hoping to move all of our existing job timers into Airflow, so they will be out of puppet.
[09:15:05] <btullis>	 I don't /think/ we need to revert to monitoring_enabled. We've got a couple of stray email from the test cluster that previously weren't set to notify, but I'm sure we can track them down.
[09:15:41] <godog>	 btullis: ok! sounds good to me, let me know if we should and I can follow up
[09:16:02] <btullis>	 Cool. It's just the emails for individual timers vs combined systemd check that was a concern, I think.
[09:16:54] <godog>	 makes sense yeah, I've updated the task with a longer term vision that is meant to address that too
[09:17:23] <godog>	 btullis: unrelated but I don't know if you saw there's a few alert emails held for moderation in 
[09:17:26] <godog>	 data-engineering-alerts@
[09:17:47] <godog>	 the emails from alertmanager with From: sre-observability@
[09:18:42] <btullis>	 Yes, thanks. Trying to fine-tune the message acceptance rules in T315486
[09:18:43] <stashbot>	 T315486: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486
[09:29:58] <godog>	 cheers!
[11:08:07] <jbond>	 yet another DNS auth server this time from meta in go https://github.com/facebookincubator/dns
[15:46:42] <klausman>	 sukhe: ok to merge your cp4050 change?
[15:48:36] <sukhe>	 oh
[15:48:38] <sukhe>	 yes please
[15:48:42] <klausman>	 ok, will merge
[15:48:43] <sukhe>	 thanks
[15:48:57] <klausman>	 and done
[15:49:04] <sukhe>	 :D
[16:57:18] <andrewbogott>	 I think I know the answer to this, but... suppose I wanted the pcc to test a patch on every single .wikimedia.cloud host that it knows about, is there a way to instruct the UI to do so?
[16:59:52] <jbond>	 andrewbogott: there is but please dont as it would likley take up all the space on the worker doing the pcc.
[17:00:14] <andrewbogott>	 I was thinking I would start it at midnight on a Friday :)
[17:00:24] <jbond>	 yu should be able to do `Hosts: O:wmcs::instance` to get something that coveres all variations that exist in wmcs 
[17:00:33] <jbond>	 that would also be big so also worth cleaning up after words
[17:00:35] <andrewbogott>	 (this is for one of those 'rename a variable everywhere' patches)
[17:01:02] <jbond>	 andrewbogott: id try with `Hosts: O:wmcs::instance` first and see if that gets yu what you want
[17:01:16] <andrewbogott>	 great!  Can you tell me how/what to clean up after?
[17:03:23] <jbond>	 andrewbogott: sure you can just deleate the folder under /srv/jenkins/puppet-compiler/output/$jobnumber on the worker that did the job
[17:03:41] <andrewbogott>	 got it, thanks
[17:04:02] <jbond>	 no probs
[17:07:24] <andrewbogott>	 oops, this pcc worker is already running out of space. Is that cleaned up periodically/automatically?  Or should I just go in there and wipe out all the older-than-a-few-days runs?
[17:09:10] <jbond>	 andrewbogott: i genrally deleate any folder larger then a week and larger then 1G.  i think you can probably expect a request for me to increase the disks on those tomorrow as i thik i allready cleaned up this week
[17:09:32] <andrewbogott>	 ok :)
[19:08:07] <andrewbogott>	 jbond:
[19:08:10] <andrewbogott>	 https://www.irccloud.com/pastebin/W3sh0RG9/
[19:08:54] <andrewbogott>	 No obvious big things to delete. What do you think about my deleting everything older than two weeks?
[19:19:52] <jbond>	 andrewbogott: i have cleaned up but for future reference its the whole folder you want to check so suomething like `du -hs ./*` is what you need
[19:20:45] <andrewbogott>	 I may not understand what you're saying... -d0 . still does a recursive count, it just reports the grand total without subtotals.
[19:22:20] <andrewbogott>	 https://www.irccloud.com/pastebin/StJme43v/
[19:22:23] <jbond>	 yes and in your paste it showed that there was 56GB by doing the command above you show the size of each folder in the current dir, each folder == a pcc report anything over a GB and older then a wek can get deleted
[19:22:33] <jbond>	 for instance there was one thay as 11GB
[19:22:36] <andrewbogott>	 ooh for the find, I see
[19:24:59] <jbond>	 andrewbogott: fyi i aw the report you did before was with out any hosts specification.  Thats not what you want as that targets production.  you need to use something like the query i posted earlier
[19:25:20] <andrewbogott>	 Oh yeah, I need to check both.
[19:25:25] <jbond>	 ./utls/pcc $changid O:wmcs::instance
[19:25:53] <jbond>	 i honestly dont think you need to check anything other then sretest in production for that change
[19:26:51] <andrewbogott>	 I don't think I know what 'sretest' means
[19:27:36] <jbond>	 oh sorry sretest1001.eqiad.wmnet is a host sre use for testing simple things with the bas e[olicy, its a good candidate for pcc if you just wont $somehost
[19:29:11] <andrewbogott>	 ok :)  you're optimistic about unintended consequences.
[19:30:02] <jbond>	 yes i am :) its  also not a problem to run pcc as you where before though
[19:30:22] <jbond>	 was more wanting to say that that query wont cover the wmcs host you wanted it too
[19:31:20] <jbond>	 (or should i say the lack of a query )