[06:22:58] I have restarted stashbot as it seems to be down [10:34:23] codfw is the active datacenter, correct? The `server` field in the response header for enwiki shows `mw1455.eqiad.wmnet`, so I am confused. [10:35:03] kostajh: these days read requests are served from both datacenters [10:35:18] ah, right. thanks [10:47:53] is it possible to link to SAL at a particular time [for an incident report]? I can link to one particular line, or a particular day, but e.g. SAL starting at 14:17 yesterday would be nice... [10:52:43] Emperor: yeah you can use sal.toolforge.org, e.g. https://sal.toolforge.org/log/wn6Uj4cBhuQtenzvXRSx [10:56:29] kostajh: yeah, that's the link to a particular line thing I alluded to; I guess I can't link to "here's SAL from the duration of the incident" or whatever, just link to key bits directly? [11:47:16] btullis: are you around? [11:47:29] effie: Yes, I'm here. [11:47:49] great! in T333377, testreduce is mistakenly under serviceops [11:47:49] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [11:48:11] we are near the window, if there is any action to be taken for it [11:48:16] now is the time [11:48:36] I'm afraid I've never heard of testreduce. Probably not our team either. [11:51:37] Looks like Content Transform Team (formerly Parsing & Product Infrastructure) #mediawiki-parsoid or #wikimedia-infrastructure [11:52:01] anything i can help with? (i'm vaguely aware of testreduce) [11:52:42] ihurbain: Thanks. It's about go get disconencted from the network for ~15 to 20 minutes. Is there any action needed before that happens? [11:55:10] let me double check we don't have anything right now and send a notification in the team, but it should be all good - it's the server on which we run rt-testing, and it shouldn't be that active at this time of the week [11:57:02] yup, we're good, no action. [11:58:03] ihurbain: Many thanks. [11:58:56] We're good to go for the switch upgrade, from perspective of the DE team. [12:12:42] I have done all of our prep for the switch upgrade an hour early, like a total dimwit. [12:27:12] btullis: I set the reminder one hour earlier as well, don't worry :) [12:28:00] ihurbain btullis thank you both [12:30:03] Reminder all, Eqiad Row D switch upgrade to take place in 30 minutes T333377 [12:30:04] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [12:32:35] topranks: cheers, eqiad is already depooled [12:33:13] is eqiad already depooled? [12:33:43] effie: doesn't seem so unless I am misreading :) [12:36:32] I think you meant for not DNS :P [12:37:03] topranks: https://gerrit.wikimedia.org/r/c/operations/dns/+/909662 whenever you want to merge [12:37:53] sukhe: super thanks [12:39:13] sukhe: Yeah, e.ffie meant for the services themselves [12:41:36] yeah [12:41:45] we were looking at things from a different POV :p [12:42:01] :P [12:43:17] topranks: can yuo ping me just before so i can disable puppet [12:43:27] jbond: sure thing [12:43:30] cvheers [13:10:27] jbond: if you can disable puppet ahead of the switch upgrade that'd be great [13:11:44] ack one sec [13:12:45] thanks [13:15:34] topranks: done [13:15:46] jbond: thanks [13:19:15] All ready my end, just waiting on switch to be ready for reboot, eqiad row D going down in ~5min [13:19:34] $deityspeed! [13:20:48] ores nodes depooled (all we need to do from ML-team's end) [13:21:16] klausman: thanks [13:25:37] OK rebooting row D switches now [13:27:30] things I didn't take into account: cumin1001 being unreachable :D [13:29:57] the old "chop off the branch your standing on" trick :) [13:30:30] klausman: there's 2002 :) [13:32:41] I'm having trouble getting into things generally in eqiad. servers not in row D. e.g. an-coord1001 or cephosd1001. Anyone else? [13:33:15] btullis: which bast are you using? [13:33:20] btullis: bast1003 is unvailable [13:33:30] so if you are using that, switch to bast2002 [13:33:40] (bast1003 is part of the hosts in the switch maintenance) [13:33:51] sukhe: yes beat me to it [13:33:58] (see Moritz's email) [13:34:01] tunnels via bast1003 will fail right now [13:34:37] Thanks all. Yes, I though I was using bast3006 but now I check you're all correct. [13:35:04] switches should be coming back online shortly, reboot completed on at least 1 [13:35:24] irc.w.o failover went fine, most bots reconnected to irc2001 by now [13:36:14] that's good to see :) [13:38:07] Starting to learn MAC addresses on the QFX devices, some machines pingable again [13:38:16] VC status looks healthy so far [13:38:20] All ML machines ping again [13:38:23] marostegui: db1133 may need a manual replication restart [13:38:46] yes, it is on my radar [13:39:44] All ports back up now [13:39:45] cumin1001 and bast are back online [13:39:46] things looking good on my side [13:42:00] restarted etcdmirror on etcd2005 (down due to conf1009 under maintenance) [13:42:48] kafka jumbo also ok, 3 nodes down and it worked nicely (haven't seen so many nodes down at once so far) [13:43:14] nice job folks :) [13:43:40] indeed, great job all! [13:43:52] good stuff! thanks for all the help folks :) [13:44:11] will wait for ten minutes or so before pooling back eqiad [13:48:06] topranks: great job! [13:48:31] dhinus: dbproxy1018 might need a reload [13:48:53] there was some host that took more to get me ssh, but it is fine now [13:50:29] ores repooled [13:51:26] other ml services have recovered without intervention) [13:51:41] * jbond will re-enable puppet [13:51:45] Device rebooted after 3 years 210 days 13 hours 22 minutes 30 seconds -> 43s :) [13:52:06] marostegui: looking [13:58:39] marostegui: haproxy restarted and the alert is gone. do you know what caused the alert? [13:59:17] dhinus: probably the switch maintenance [13:59:27] repooling eqiad unless objections [14:00:31] sukhe: no I think all looking good thanks [14:00:59] ok, going ahead! [14:10:05] eqiad repooled [14:15:24] I've done a first draft of yesterday's incident report - https://wikitech.wikimedia.org/wiki/Incidents/2023-04-17_eqiad/LVS would be grateful for expert review (hi sukhe :) ) [14:15:55] Huh, where did my graphs go? [14:18:00] https://wikitech.wikimedia.org/wiki/File:2023-04-17-haproxy.png says no file exists, but I see from the upload log that https://upload.wikimedia.org/wikipedia/labs/3/39/2023-04-17-haproxy.png does exist. I just used the VE upload tool... [14:19:17] Before I Make It Worse, what did I do wrong, and how do I get my images in the right place? [14:21:43] Emperor: thanks for writing this! [14:24:15] Emperor: Thanks for the IR <3 [14:26:01] Emperor: I know [14:26:11] Re: image [14:26:37] jynus: ... ? :) [14:26:43] Emperor: workaround is to wait a few hours https://phabricator.wikimedia.org/T334487 [14:27:06] ticket is filed, it is something related to thumbs, your file is there (in swift) [14:28:33] it is currently in a "accidentally eventually consistent" mode :-) https://wikitech.wikimedia.org/wiki/Special:ListFiles [14:29:31] but the file is uploaded: https://upload.wikimedia.org/wikipedia/labs/1/16/2023-04-17-errors.png [14:29:43] OK, I am happy with "procrastinate, it'll work eventually" :) [14:30:11] not saying that's good but so you don't spend hours like me figuring out what I did wrong [14:30:37] I suspect T334232 might be related [14:30:38] T334232: Wikitech experiencing a spike of stale cache errors since 2023-03-15 - https://phabricator.wikimedia.org/T334232 [14:30:53] thanks, appreciated. It is strange, though - the thumbs looked to be there while I was using VE [14:31:27] maybe those were client side? no idea [14:59:07] jbond: did the Cumin run to re-enable complete? I just noticed that puppet is still disabled with the note pointing to T333377 on serpens.w.o [14:59:07] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [14:59:33] moritzm: shuld have done [15:00:12] ill rerun it to kick any straglers [15:00:50] ack, thanks [15:00:52] We should maybe repool eqiad services if the mainteannce is done effie [15:07:13] repooling services [15:12:45] sorry claime, I had a crisis here [15:12:48] are you repooling ? [15:13:50] Yes [15:14:13] no worries ;) [15:49:49] idk what's going on with the sre.discovery.datacenter cookbook but it keeps hanging while checking DNS [15:50:20] Not always on the same service either [15:50:47] It's supposed to retry every 3s but it just stops trying and hangs there. Nothing remarkable in the logs [15:50:51] any output you can share from there? [15:50:59] [35/53] Handling A/A service swift-ro [15:51:01] Setting pooled=True for tags: {'dnsdisc': '(swift-ro)', 'name': 'eqiad'} [15:51:03] [1/15, retrying in 3.00s] Attempt to run 'cookbooks.sre.discovery.datacenter.DiscoveryRecord.check_records' raised: Error checking auth dns for swift-ro.discovery.wmnet in eqiad: resolved to 10.2.1.27, expected: 10.2.2.27 [15:51:06] er [15:51:08] hmm [15:51:14] It's been stuck on that for 7 minutes [15:51:30] So more than ttl, and way more than it's retry time [15:51:35] this: resolved to 10.2.1.27, expected: 10.2.2.27, seems familiar for the EDNS client subnet issue we had once [15:51:46] That's expected though [15:51:50] A normal log is [15:52:01] this for 3/4 retries [15:52:12] we depooled one dns host, not sure if it is related to that or not but it's repooled again [15:52:15] Then it resolves ok and it goes to wipe [15:52:19] Maybe [15:52:20] let me check if running authdns-update works fine [15:52:29] we did it for pooling eqiad so it was OK [15:52:31] If it's that I'll ctrl-c and restart [15:52:32] but let me check [15:52:56] It's not a problem to restart it tbh, it's just strange [15:53:10] but didn't you restart it again already? [15:53:15] Yep [15:53:24] And it repooled a few services [15:53:28] Then hung again [15:53:49] all OK on running authdns at least [15:53:52] It repooled 18 services to be exact [15:54:23] I'll restart it again, it's more important that we repool the services than anything else at that point [15:54:28] ok [15:54:59] The service is actually pooled too [15:55:08] it just stops checking [15:59:23] and again [15:59:28] * claime grumbles [15:59:41] I'll attach an strace right away this time [16:01:04] Watch it become a schrodinger's bug that disappears when observed [16:01:50] ha [16:02:47] I would have attributed this to DNS and what happened today (who wouldn't attribute it to DNS?) but I just don't see it [16:02:50] Ah, new bug just dropped [16:03:03] File "/srv/deployment/spicerack/cookbooks/sre/discovery/__init__.py", line 69, in resolve_with_client_ip [16:03:05] logger.debug('[%s] %s -> %s TTL %d', nameserver, record, answer[0].address, answer.ttl) [16:03:07] File "/usr/lib/python3/dist-packages/dns/resolver.py", line 277, in __getitem__ [16:03:09] raise IndexError [16:03:11] :D [16:03:27] better each time [16:04:31] Great that now happens every run [16:04:36] I'll finish pooling manually [16:04:40] I don't have time to debug it rb [16:04:42] rn [16:05:05] is that the entire traceback? [16:06:06] No, I can't paste the entire traceback [16:06:10] Gimme a sec to phaste it [16:07:30] https://phabricator.wikimedia.org/P47096 [16:08:20] ok all pooled [16:09:35] so something is happening at answer[0].address [16:14:34] ok we're done with the services repool, I'm off [16:14:41] enjoy! [16:14:52] we should revisit the above at some point [16:14:55] thanks, have a nice day [16:14:57] Yep [16:15:12] It's strange that it worked perfectly for depooling [16:15:24] It's the first time I see it bug in that way [16:16:45] I've only partly skimmed the above, but yeah I'd guess a dns1002 depool relation as well [16:17:02] because it seems a lot like the pattern of the LVS thing before, so probably some of the same thought crimes are being committed [16:17:24] but dns1002 was also depooled when the service cookbook was used for depooling [16:17:29] and now dns1002 is back and pooled [16:17:31] (that some automation can do data things with the set of DNS servers from puppet and ignore their current operational states) [16:17:34] unless something is missing in between [16:17:42] or the problem only affects one side of the operations [16:17:45] yeah [16:18:08] that happens so I also ran agent on A:dns-auth immediately after pooling [16:18:26] really not seeing where it is failing if it was the dns1002 depool that is causing this [16:18:38] ok [16:19:08] the index error suggests that it got a NXDOMAIN or REFUSED and thus the answer[0] returns the error [16:19:16] but I am just speculating based on the traceback :) [16:19:35] I don't think we'd ever return NXD or REFUSED though [16:19:46] happy to try other things to debug this. I am worried about this coming up in some other form [16:19:49] or some other time [16:19:51] but maybe the same indexError can happen from a total failure of the query (e.g. at the network or host-down kind of level) [16:20:39] sukhe: I forgot to ask: did you authdns-update dns1002 after the window it was out of authdns_servers? [16:20:44] it's also posible we missed some data updates [16:20:48] bblack: yep [16:20:52] ok [16:20:53] twice now, once to pool eqiad [16:20:59] and also test if something was happening there [16:21:18] hmmm [16:21:32] now I'm trying to remember also if authdns-update will fix netbox-driven data though [16:21:38] (if we missed netbox updates during that window) [16:21:47] I think maybe it doesn't [16:22:01] I can try running the dns netbox cookbook to check! [16:22:02] doing [16:23:25] I may have got there after your push [16:23:51] but a quick check on dns100[12]'s actual files /etc/gdnsd/zones/netbox/ says they're all the same on both (just before my previous line) [16:24:00] yeah, NOOP [16:24:04] No changes to deploy. [16:24:15] my very fancy check was like: [16:24:17] bblack@dns1002:/etc/gdnsd/zones$ cd netbox/ [16:24:17] bblack@dns1002:/etc/gdnsd/zones/netbox$ md5sum * >/tmp/x [16:24:17] bblack@dns1002:/etc/gdnsd/zones/netbox$ md5sum /tmp/x [16:24:17] c19ed55b6a5608e1ac28c06c8c24c418 /tmp/x [16:24:41] :P [16:30:08] dns1002 looks good on the routers too. you never know if it's unreachable because of some bird/BGP misgivings [21:14:06] jhathaway: Thanks for fixing many typos https://phabricator.wikimedia.org/P47135 [21:14:30] yup! [21:21:41] thanks for merging it Amir1! [21:22:19] I didn't do much, you did the whole work