[05:43:04] Reverted the fix for the `check_cirrus_settings.py` script. A lot of noise being generated. codfw servers have real drift that they're alerting on, but not sure about the eqiad ones that were alerting...looks like the hosts that were alerting thought they saw nothing for `remote.psi.seeds`, but I think those should have been present [05:43:14] https://www.irccloud.com/pastebin/8lDWhcIe/two_critical_types [05:43:31] Can fix the codfw drift tomorrow [10:45:20] lunch [11:01:16] lunch [12:14:54] errand&lunch [14:49:56] greetings! Overslept today, my apologies! [14:51:25] o/ [14:54:35] inflatador: hello! I betcha you'd get more useful responses to your jumbo frames on the ops mailling list [14:54:44] Operations LIst [14:56:29] ah , thanks ottomata ! [15:17:17] I see a lot of nagios alerts for ES in DFW from early morning today UTC time, anyone know why? [15:23:50] inflatador: dfw you mean codfw? [15:24:36] correct [15:26:40] inflatador: if it's related to check_cirrus_settings.py I think Ryan said he would fix that today if not do you have a name for the alert? [15:28:17] dcausse yeah, looks like that must be it. Sorry , should have read the alerts more closely [15:28:26] np! [15:55:29] \o [16:04:52] WCQS celebration meeting starting: https://meet.google.com/wsv-xrsz-gfk [16:05:04] ryankemper, dcausse, mpham, cbogen_ ^ [16:05:06] carly and i running a little late from previous meeting [16:05:11] oops [16:57:23] quick workout, back in ~30 [17:37:30] started a single 2-hr run of saneitizer to see how it goes, hugh identified some issues with the k8s deployment and has deployed changes [17:38:02] i suppose with codfw out of date it will be heavier than usual [17:48:47] and back [17:49:44] I think I did something that helped cirrus search jobs https://logstash.wikimedia.org/goto/9a1f0648d2b4fb48ecbb9cd10d108984 [17:50:19] Amir1: yes, it turns out some indices in codfw went read-only and we didn't notice :S [17:50:34] aah [17:50:36] Thanks [17:50:45] Amir1: but i doubt that is the primary issue, we were seeing the queue configured for 100 concurrent workers but only rnuning 10 concurrent jobs [17:51:10] it would have increased the levels a bit with retries though [17:52:46] I see [17:52:50] Thanks [18:01:58] * ebernhardson would like to get rid of all these not-really failures where it tries to delete archives from cloudelastic, but we don't put archive indices on cloudelastic so they always fail. [18:31:49] Not sure entirely what to do with saneitizer :S Hugh looked into it and improved the perf of the existing deployment but it isn't enough to meet our use case. I suppose potentially we could shard the jobs further, multiple queues per cluster, allowing multiple pods to process jobs. Not sure how that would work yet [18:32:28] could revisit the decision to always queue ElasticaWrite instead of running them in-process, but that was done so that if one cluster falls behind it doesn't lag everything else with it [18:33:57] i suppose if we could bundle more work together from saneitizer->elasticawrite, such that we are sending tens of docs per job instead of one it would probably help, but seems like a bigger architectural change [18:44:48] sorry, been back for awhile now ;p [18:58:16] aand lunch [19:06:52] hmm something changed and cindy can't get a successfull `npm install` of cirrus :S [19:13:42] oh, it's only trying to test an upgrade to webdriver 6 and failing, but then that makes the future runs fail. [19:22:55] * ebernhardson should have found a way to snapshot the lxc container instead of this mess... [20:16:45] again, been back awhile...comm skills already on weekend break [23:29:26] Look out weekend, here I come! https://www.youtube.com/watch?v=KSC-8mxfXKI