[06:00:35] <_joe_> !incidents [06:00:36] 3185 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [06:00:36] 3183 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [11:18:01] vgutierrez: wrt ganeti/eqsin, indeed I have. the servers in eqsin have been refreshed (and expanded, eqsin didn't have the fourth Ganeti node yet), so the current cluster consists of 5004-5007, 500[12] are decommed, 5003 will be some time today [11:37:58] effie: You have a patch, something about adding two host to memcached cluster [11:38:17] mmmm I have that too on my puppetmaster [11:38:25] I guess we beat the lock [11:38:31] I will no mine [11:38:44] slyngs: please go ahead and merge [11:38:55] effie: Okay, I'll merge :-) [11:38:58] thank you! [11:39:19] Done [11:39:28] cheers ! [12:23:05] Anyone got ideas on T325056? It is not reproducible on beta cluster, but I see it on enwiki. [12:23:05] T325056: Can't log in or out - Invalid CSRF token - https://phabricator.wikimedia.org/T325056 [12:23:15] kostajh: there's an outage going on [12:24:44] Ack, thank you [12:59:41] !incidents [12:59:42] 3186 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.29 ip4 sessionstore:8081 probes/service http_sessionstore_ip4 ops page eqiad prometheus sre) [12:59:42] 3185 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [12:59:42] 3183 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [15:32:06] !incidents [15:32:07] 3186 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.29 ip4 sessionstore:8081 probes/service http_sessionstore_ip4 ops page eqiad prometheus sre) [15:32:07] 3185 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [15:32:07] 3183 (RESOLVED) [FIRING:1] ProbeDown (10.2.1.32 ip4 wdqs-ssl:443 probes/service http_wdqs-ssl_ip4 ops page codfw prometheus sre) [15:32:17] hmm it looks like !incidents is broken via PRIVMSG [15:33:43] I don't know much about sirenbot, but if there's a way to resolve those WDQS alerts LMK [15:33:58] hmm those are flagged as resolved already [15:34:12] I guess it's listing the incidents during the last X hours [15:35:00] yeah, that happened ~20 hrs ago or so [15:36:06] last 24 hours :) [15:36:18] https://gitlab.wikimedia.org/repos/sre/vopsbot/-/blob/main/vo_api.go#L272-273 [17:21:53] I just attempted to restart the DHCP server on install1003 and it's not starting. [17:25:39] OK, sorry. Panic over. I removed the entry for a missing file from `automation/proxies/ttyS1-115200.conf` and it's started. [17:56:28] btullis: sorry I'm not really here, see also https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_Automation [17:56:54] it's weird you had to touch anything manually [17:57:53] normally a restart of dhcp service is not needed either. was something wrong with it? [17:58:39] I was deleting a stray bit of DHCP config from the end of October. [17:58:43] https://www.irccloud.com/pastebin/DQhSGOFa/ [17:59:21] hmm. seems like a remnant from something done manual in the past? [17:59:55] well, if it runs and puppet runs and does not revert things then I guess it's cleaned up [18:00:25] Yes, this is the context. https://phabricator.wikimedia.org/T314156#8457192 [18:01:49] OK, I restarted the serer manually because it says that this is what the cookbook does when reimaging and creating a new snippet. [18:02:07] https://usercontent.irccloud-cdn.com/file/MEYfiReT/image.png [18:03:05] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Reimage_operations [18:04:39] btullis: oh, so. I think that's just outdated docs [18:05:24] ah, well, it just lists the things the cookbook does now. ACK. gtocha [18:05:26] gotcha [18:07:27] sorry I can't follow this right now, please check with pa.paul and/or jb.ond. No manual intervention on the automation side of DHCP should be needed, nor manual mangling of its files that is handled by the cookbook [18:07:36] also if you want to just test the dhcp side there is a dhcp cookbook for that [18:07:50] sre.hosts.dhcp [18:08:33] I think what happened is he just cleaned things up and they are good now [18:08:45] while previously a manual special case was in place [18:08:57] for this host specifically for debugging install issues some time ago [18:09:02] yeah, sorry. All is good, I just had to clean up a bit of failed DHCP automation. I thought I should tell you as soon as I did something which caused the DHCP server not to restart, even if for a short time. [19:02:02] during decom cookbook, once getting to netbox data sync, I have multiple unrelated changes :/ [19:02:23] never know whether cancel makes it worse though [19:02:52] in this case, hosts that would be affected but are unrelated to my host: ganeti1011, parse1002, wdqs2005 [19:04:04] the first 2 are an "active -> failed" transition and the latter an "failed -> active" [19:04:33] looking in phabricator.. [19:06:53] mutante I dunno what happened w/wdqs2005 but it's in production [19:07:33] If I can do anything to help LMK [19:08:29] inflatador: something or someone changed the status from "failed" to "active" which then sounds correct. thanks for confirming it. the part I am not sure about exactly is what triggered the status change [19:10:27] parse1002 was repooled today per https://phabricator.wikimedia.org/T324949 and also changes to "active".ok then [19:10:53] except the state changes from active to failed [19:12:16] changelog tells me this was a manual change in netbox. so those always need a cookbook run which did not happen yet [19:13:10] seems like I should accept the change, then set it back to active, then sync again [19:18:42] mutante: ganeti1011 is benign. I noticed today that it was still marked as failed (althought the disk was swapped) [19:19:26] moritzm: ACK, thanks! I just accepted the diff. it's synced now [19:19:39] just saw that ticket a moment ago [19:20:02] setting parse1002 to active, syncing one more time. then it'sclean [19:22:16] Perhaps someone more fluent in Python can help me fix the CI error that codesearch is having [19:22:17] > ImportError: cannot import name 'Config' from 'tox.config' (/src/.tox/.tox/lib/python3.7/site-packages/tox/config/__init__.py) [19:22:57] Noticing it even when I recheck a patch that landed fine last month, so it appears tox or tox-wikimedia has changed from underneath between then and now. [19:23:04] e.g. https://gerrit.wikimedia.org/r/c/labs/codesearch/+/861508 failure at https://integration.wikimedia.org/ci/job/tox-docker/29225/console [19:23:39] https://gerrit.wikimedia.org/g/integration/tox-wikimedia hasn't changed in 2y, so I guess a new Python version? [19:23:55] https://gerrit.wikimedia.org/r/plugins/gitiles/integration/tox-wikimedia/+/refs/heads/master/tox_wikimedia/__init__.py#24 [19:27:59] Krinkle: tox 4 was released last week, I supect tox-wikimedia needs to be updated for it... [19:28:22] https://tox.wiki/en/latest/faq.html#tox-4-new-plugin-system [19:29:10] is there a cultural preference and/or technical requirement to not pin versions a bit for testing? [19:29:25] mutante: ack, thx [19:30:15] My memory is probably disporportionatly biased as I feel inexperienced in Python, but it seems I'm only ever running into CI failures with Python due to unplanned/unsollicited dependency upgrades, whereas for PHP and JS we tend to do them more when we want to. [19:30:22] until we adopt Poetry, it's just annoying in Python to pin stuff [19:31:21] hm.. so "just" specifying "tox > 2.0 <= 3.x" or something somewhere isn't feasible? [19:31:38] * Krinkle looks at navtiming rerpo [19:31:54] I think it would go in https://gerrit.wikimedia.org/r/plugins/gitiles/integration/tox-wikimedia/+/refs/heads/master/setup.py#27 [19:32:12] oh it's pulling in tox itself [19:32:27] and tox-wikimedia in turn isn't even specified in the repo [19:32:34] that's just CI forcing the same for all repos [19:33:00] I can poke at this after work, either doing <4 or the actual fix and doing tox>=4,<5 [19:33:05] no, it's at https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/tox.ini#4 [19:35:14] it's pretty terrible AIUI, we run tox, it sees it needs tox-wikimedia, so then it sets up a virtualenv, installs tox + tox-wikimedia in that, and then uses that tox to actually run things