[08:10:24] hello folks [08:10:28] anybody working on netflow1002? [08:11:07] the ens5 interface seems to have no connectivity [08:15:54] ah it runs on ganeti1015, that was reimaged yesterday [08:15:57] moritzm: --^ [08:20:17] checking [08:26:15] I don't believe it's an issue with the Ganeti server it's running on, there's also kubestagetcd1005 running on ganeti1015, which is fine. having a closer look [08:33:17] moritzm: ah okok didn't check the other vms, but the timings from icinga are suspicious (netflow1002 alarms since ~20h ago) [09:10:00] btullis: in light of the thread re: job timers should we go back to the previous way for analytics timers? [09:12:42] godog: Do you mean abandoning this change for now? https://gerrit.wikimedia.org/r/c/operations/puppet/+/841924 or do you mean reverting to `monitoring_enabled` as oppsed to `send_email`? [09:13:29] btullis: the latter [09:13:44] As an aside, in the longer run, we're hoping to move all of our existing job timers into Airflow, so they will be out of puppet. [09:15:05] I don't /think/ we need to revert to monitoring_enabled. We've got a couple of stray email from the test cluster that previously weren't set to notify, but I'm sure we can track them down. [09:15:41] btullis: ok! sounds good to me, let me know if we should and I can follow up [09:16:02] Cool. It's just the emails for individual timers vs combined systemd check that was a concern, I think. [09:16:54] makes sense yeah, I've updated the task with a longer term vision that is meant to address that too [09:17:23] btullis: unrelated but I don't know if you saw there's a few alert emails held for moderation in [09:17:26] data-engineering-alerts@ [09:17:47] the emails from alertmanager with From: sre-observability@ [09:18:42] Yes, thanks. Trying to fine-tune the message acceptance rules in T315486 [09:18:43] T315486: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 [09:29:58] cheers! [11:08:07] yet another DNS auth server this time from meta in go https://github.com/facebookincubator/dns [15:46:42] sukhe: ok to merge your cp4050 change? [15:48:36] oh [15:48:38] yes please [15:48:42] ok, will merge [15:48:43] thanks [15:48:57] and done [15:49:04] :D [16:57:18] I think I know the answer to this, but... suppose I wanted the pcc to test a patch on every single .wikimedia.cloud host that it knows about, is there a way to instruct the UI to do so? [16:59:52] andrewbogott: there is but please dont as it would likley take up all the space on the worker doing the pcc. [17:00:14] I was thinking I would start it at midnight on a Friday :) [17:00:24] yu should be able to do `Hosts: O:wmcs::instance` to get something that coveres all variations that exist in wmcs [17:00:33] that would also be big so also worth cleaning up after words [17:00:35] (this is for one of those 'rename a variable everywhere' patches) [17:01:02] andrewbogott: id try with `Hosts: O:wmcs::instance` first and see if that gets yu what you want [17:01:16] great! Can you tell me how/what to clean up after? [17:03:23] andrewbogott: sure you can just deleate the folder under /srv/jenkins/puppet-compiler/output/$jobnumber on the worker that did the job [17:03:41] got it, thanks [17:04:02] no probs [17:07:24] oops, this pcc worker is already running out of space. Is that cleaned up periodically/automatically? Or should I just go in there and wipe out all the older-than-a-few-days runs? [17:09:10] andrewbogott: i genrally deleate any folder larger then a week and larger then 1G. i think you can probably expect a request for me to increase the disks on those tomorrow as i thik i allready cleaned up this week [17:09:32] ok :) [19:08:07] jbond: [19:08:10] https://www.irccloud.com/pastebin/W3sh0RG9/ [19:08:54] No obvious big things to delete. What do you think about my deleting everything older than two weeks? [19:19:52] andrewbogott: i have cleaned up but for future reference its the whole folder you want to check so suomething like `du -hs ./*` is what you need [19:20:45] I may not understand what you're saying... -d0 . still does a recursive count, it just reports the grand total without subtotals. [19:22:20] https://www.irccloud.com/pastebin/StJme43v/ [19:22:23] yes and in your paste it showed that there was 56GB by doing the command above you show the size of each folder in the current dir, each folder == a pcc report anything over a GB and older then a wek can get deleted [19:22:33] for instance there was one thay as 11GB [19:22:36] ooh for the find, I see [19:24:59] andrewbogott: fyi i aw the report you did before was with out any hosts specification. Thats not what you want as that targets production. you need to use something like the query i posted earlier [19:25:20] Oh yeah, I need to check both. [19:25:25] ./utls/pcc $changid O:wmcs::instance [19:25:53] i honestly dont think you need to check anything other then sretest in production for that change [19:26:51] I don't think I know what 'sretest' means [19:27:36] oh sorry sretest1001.eqiad.wmnet is a host sre use for testing simple things with the bas e[olicy, its a good candidate for pcc if you just wont $somehost [19:29:11] ok :) you're optimistic about unintended consequences. [19:30:02] yes i am :) its also not a problem to run pcc as you where before though [19:30:22] was more wanting to say that that query wont cover the wmcs host you wanted it too [19:31:20] (or should i say the lack of a query )