[07:59:25] volans: gimme a sec [07:59:56] yep, same thing, now the question is if serviceops is interested in debugging the "bad" pod before deleting it [08:01:06] it is two in a row [08:01:29] ok I will dig into it a bit, see if something comes up, and then I will kill it [08:02:08] effie: is there a way to "isolate" it so that it doesn't serve anymore traffic but is still around for debugging? [08:02:48] so at least we don't serve the 5xx [08:04:02] one thing that comes to mind is complicated, as in changing a bit the selector from the service [08:05:08] changing the labels of the pod maybe? [08:06:10] could be, let me have a go [08:07:59] right now it has labels: app=mediawiki,deployment=mw-api-ext,pod-template-hash=7686884f77,release=main,routed_via=main [08:15:12] ok the pod disappeared before we could debug anything [08:15:22] and errors are back to zero and recovery is comung [08:32:42] for posterity the "disappearance" was due by the train passing by... [08:39:11] <_joe_> ah damn [08:39:33] <_joe_> volans: I think the problem is that the pod was still responding ok to its readiness probe [08:39:57] <_joe_> while it shouldn't have [08:40:26] <_joe_> volans, effie you can look in prometheus for the metrics of that specific pod [08:40:36] <_joe_> maybe they'll tell you something about why it failed [08:40:42] indeed that's why we wanted to isolate it for debugging [08:40:56] also check if it was on the same host of the one from yesterday [08:41:31] <_joe_> that can still be verified looking at events I think [08:41:39] <_joe_> volans: but I doubt the host is the issue [08:43:27] <_joe_> uhm so we just call a php script that returns 'OK' [08:44:05] <_joe_> maybe we need a script that loads mediawiki at the very least, and tries to do some work locally [08:44:24] <_joe_> like accessing data in apcu [08:44:29] <_joe_> something like that [08:44:36] something happened to that pod around 6:26 UTC, graphs are all almost zeroed since then (thanks effie for finding the dashboard :D ) [08:44:55] <_joe_> volans: if you talk about php graphs [08:45:05] no itis cpu graphs [08:45:05] <_joe_> that's because there was no worker free to report metrics probably [08:45:08] no pod graphs, cpu/network/etc.. [08:45:21] <_joe_> uhm that's strange indeed [08:45:23] and are not 0, but very low [08:45:31] <_joe_> ah no that is normal [08:45:37] <_joe_> if every request is in a deadlock [08:45:53] <_joe_> have you checked the slow logs from that pod already? [08:45:55] but we alerted at 7:57 [08:46:33] <_joe_> volans: what was the name of the pod? [08:47:05] mw-api-ext.eqiad.main-7686884f77-ql69d [08:47:44] https://grafana.wikimedia.org/goto/8s5YHnXIg?orgId=1 [08:47:54] <_joe_> https://logstash.wikimedia.org/goto/8c535747c43965b0339b316c85f3510a [08:48:00] <_joe_> not that usefulk I fear [08:50:09] ok we will dig deeper, _joe_ we may ping you if nothing useful comes up [09:43:31] hello folks, as FYI I am deploying spicerack 8.8.0 on cumin2002 to test it [09:45:22] <_joe_> elukey: how do you plan to test spicerack? running cookbooks? [09:47:04] _joe_ that is one way, or we can use the code directly (Riccardo has a script to use single modules in repl) [09:51:14] elukey: that was supposed to be a secret [09:51:21] :p [09:53:05] yes sure :D [10:15:07] as soon as I find a way to make it safe I'll puppetize it :D [11:50:49] any mainteance in progres in codfw? [11:51:16] I racresetted a host, doubtful that's the reason [11:51:21] !incidents [11:51:22] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:51:22] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:51:22] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:51:23] there was a heads up usually so [11:51:23] No scheduled vendor maintenance on calendar [11:51:39] might be netowrk [11:51:56] I see a lot of conn timeouts [11:51:58] which graph shall we start from then [11:52:01] yeah probably network, gitlab2002 (also in codfw) also alerted [11:52:15] lots of different things indeed [11:52:18] XioNoX, topranks: anything ongoing with network in codfw? [11:52:24] yup.. DC wide [11:52:28] should we depool it? [11:52:29] great [11:52:30] depool? [11:52:31] volans: yeah something odd [11:52:32] yes [11:52:44] volans: I can get the patch ready [11:52:50] volans: +1 depool [11:52:56] alright hangon [11:53:07] from some of the alerts.. "Unknown server host 'db2161.codfw.wmnet" --> DNS impacted as well there [11:54:17] https://gerrit.wikimedia.org/r/c/operations/dns/+/1055189 [11:54:38] +1 [11:54:54] looking [11:54:56] * kamila_ updating statuspage [11:54:59] effie: we shoudl start an incident on the status page IMGO [11:54:59] +1ed [11:55:04] kamila_: <3 thanks [11:55:17] volans: yes we should [11:55:42] so low-traffic LVS is impacted and lvs-secondary, lvs-high-traffic1 and 2 are happy though [11:55:51] so also [11:55:52] 14:55:11 Unable to find image 'docker-registry.wikimedia.org/releng/operations-dnslint:0.0.12-s3' locally [11:55:52] 14:55:11 docker: Error response from daemon: received unexpected HTTP status: 503 Service Unavailable. [11:56:02] vgutierrez: should I stop? I'm deploying right now [11:56:03] fwiw I'm 99% sure what it is (me) and fixing [11:56:12] topranks: <3 [11:56:20] topranks: shall we hold our horses? [11:56:22] topranks: great [11:56:25] clush: 3/15 [11:56:33] it's a bit late for that, but we can rever [11:56:35] OK - authdns-update successful on all nodes! [11:56:40] apaprently on codfw's ones too [11:57:01] volans: the patch wasn't merged? [11:57:03] ah no sorry, I'm stupid [11:57:13] clicked submit but not continue on the popup [11:57:19] so now is merged [11:57:23] lol [11:57:26] wiating a second to deploy [11:57:28] might have done that a few times myself before ^^ [11:57:29] if we have a solution [11:57:48] topranks: it's a thing of a minute or 10? [11:57:50] topranks, what's the ETA for fixing? [11:58:10] 2-3 mins [11:58:25] I think that by everyone of us pinging him on IRC, is not helpin [11:58:40] So surprisingly, codfw mw-web was still serving ~1krps [11:58:46] We need to know which actions to take [11:58:54] godog/o11y: rsyslog is munching 5 CPU cores in lvs2012... [11:59:05] claime: cp@text seems to be ok there [11:59:09] IMO depooling is still the best way. By the laws of the universie it will take 6-10 minutes [11:59:12] or longer [11:59:21] yeah I agree we should depool [11:59:22] rsyslog is also working hard on the gitlab host :) [11:59:32] unable to hit kafka? [11:59:33] I'm serving between 5 to 20% 5xx [11:59:50] the impact is not enormous: [11:59:50] https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&refresh=5m [11:59:54] vgutierrez: ok [12:00:07] I was about to say that it is not as bad as it looks [12:00:17] checking [12:00:45] ok hopefully I've undone the damage I did [12:00:54] 🤞 [12:00:57] recoveries are coming [12:00:57] recoveries coming up [12:01:02] <3 topranks [12:01:06] 200 5xx/s to mw-web is not great though [12:01:07] saving the day! [12:01:09] let's see it recover [12:01:11] topranks: does that include mental damages? no ? [12:01:14] 👍 [12:01:14] <# [12:01:14] give it a second [12:01:26] topranks: many thanks, lets see how things go [12:01:27] effie: not at all, I'm only preparing for the big network migration today ffs .... [12:01:34] rsyslog CPU recoverying in lvs2012 [12:01:36] topranks: <3 [12:01:40] thanks for the fix topranks [12:01:55] Ihttps://gerrit.wikimedia.org/r/c/operations/dns/+/1055193 [12:01:57] for the revert [12:02:12] so - in brief - I miscalculated and added the codfw row C and D vlan interfaces to the new spines in codfw [12:02:20] 5xx recovered [12:02:20] +1ed volans [12:02:43] that creates a BGP route for those networks on the new spines. [12:03:04] my miscalculation was that I hadn't appreciated that those spines are connected to the row A spines directly [12:03:14] volans: topranks shall we put it on an incident report or not, is the question [12:03:31] so while the change didn't stop the CRs talking directly to row C/D, it caused row A/B to see that route and prefer it [12:03:34] I've merged the revert [12:03:40] so authdns should be a noop now [12:03:41] I'm curious if that triggered udp sat from k8s logging [12:03:42] the C/D spines are not yet connected to the hosts in those rows so.... blackhole traffic :( [12:03:44] sorry [12:04:02] it logged up to 300k messages per minute just for mediawiki [12:04:41] we got new pages [12:04:46] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:06:24] tons of hosts being repooled n lvs2013 :) [12:06:25] topranks: lsw in codfw c and d still reported as down by icinga [12:06:30] apus in codfw is a bit sad about slow heartbeats, but I'm guessing that'll recover now the network is happier again [12:06:49] interesting all from the host in C2 [12:07:04] but might be icinga slow [12:07:04] volans: they aren't in service yet - I disabled the port connecting to those spines, I can re-enable it shortly but will wait a moment [12:07:10] ah ok [12:07:14] no worries that's fine [12:07:21] no it's cos I shut the CR ports connecting to all the new switches - as they aren't in use [12:07:27] it's just harder to distinguish real issues with fake ones :D [12:07:34] ack [12:07:54] yeah. probably they shouldn't be in icinga yet tbh that is also n me [12:08:02] icinga looks better, alerts is still recovering [12:08:27] vgutierrez: should I run an authdns-update just inc ase? [12:08:41] it never really hurts [12:08:55] k [12:09:21] bblack: weird prompt, never happened to me [12:09:22] there will be some bgp alerts for CRs and SSW1-A* in codfw, that's expected also and shouldn't be an issue [12:09:24] Pulling the current revision from https://gerrit.wikimedia.org/r/operations/dns.git [12:09:27] Reviewing 39d38739a4ec0d9d9117c9d7ae0266e04b86af0f... [12:09:31] Merge these changes? (yes/no)? [12:09:39] and there are 2 empty lines between my last 2 lines [12:09:46] it's the noop right? [12:10:02] just checking to be sure :) [12:10:04] yeah I assume that's because 2x commits netted out to zero lines diff [12:10:15] ok, proceeding [12:11:12] all done [12:12:12] kamila_: <3 thanks a lot for the continous update of the status page, for now I'd leave it for a moment like that until we're confident everything is recovered [12:12:50] volans: sure, happy to free you people's hands [12:13:30] [feeback for o11y] it would be really useful to filter alerts on AM by duration, or sort them, like in icinga, to distinguish from the pre-existing ones and the outage ones [12:13:41] all pages resolved now [12:14:39] [feedback for everyone] it would be really useful if we kept our active alerts cleaner when we're not in an outage, so it's not such a mess when we're in one :) [12:15:06] +1000 [12:15:16] eheheh [12:16:28] I'm about to force a puppet run on failed hosts on codfw, might help with the recovery [12:16:30] I'll handle the sessionstore pods that got moved because the nodes were down [12:17:07] claime: thanks I was about to ask [12:17:10] about that alert [12:17:21] just need to check the nodes are back up [12:17:26] * volans running sudo cumin -b 30 -p 95 'A:codfw' 'run-puppet-agent -q --failed-only' [12:20:05] has anyone created a doc yet? otherwise I'll start one and start backfilling it [12:20:31] I can do that if you're busy [12:20:45] sobanski: that would help, thanks a lot [12:21:06] all done with sessionstore [12:21:18] <3 [12:21:45] my $beverage_of_choice debts are growing fast today [12:22:52] Can we consider that state is stable enough for deployments to restart? [12:23:24] re-checking icinga/AM [12:24:52] as far as mw-on-k8s is concerned, all good [12:25:08] nice [12:25:27] <3 [12:25:42] yeah I think all remaining things are minor, at least they look so [12:26:10] commented in the other channel [12:27:49] I'm gonna grab some lunch [12:28:02] thanks everyone that interveened, it helped a lot [12:29:34] puppet run completed [12:29:39] all hosts successful [12:36:33] * volans grabbing some food, will backfill the incident doc more later [12:39:12] [apus ceph cluster HEALTH_OK again now] [12:44:55] I think we can resolve the status page then, any objection? [12:47:30] going once... :) [12:48:05] +1 [12:48:58] sold :D [12:49:24] apologies again for the drama folks [12:49:44] thanks everyone for stepping in, esp. volans for taking charge <3 [12:50:35] I've reverted the root-cause of the problem (row C/D vlan IP interfaces added on new spines in row D) [12:50:49] no worries topranks, happened to be oncall :-P [12:51:00] So I'll now roll-back the emergency changes made to restore things (ports shut down on CRs / row A spines) [12:51:22] I was worried you might have been having a boring day yeah :P [12:57:17] Ok everything rolled-back to previous state looks ok so far [13:06:25] * volans watching closely :D [14:52:31] Folks I am about to commence work on T366941 [14:52:31] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [14:53:08] I'm confident that the plan is ok and we won't have any problems, I will be extra cautious of course given my earlier mistake [15:08:34] hey folks, I disabled puppet on install2004 to test https://phabricator.wikimedia.org/T363576#9994708 with Papaul. Please ping me if you need puppet running in there. [15:11:03] nevermind, TIL about the dhcp cookbook [17:59:23] FYI, merging DNS changes in https://gerrit.wikimedia.org/r/c/operations/dns/+/1055256 to direct appservers-ro.discovery.wmnet to failoid [18:14:51] same deal, now merging https://gerrit.wikimedia.org/r/c/operations/dns/+/1055268 to direct api-ro to failoid [18:28:54] FYI, appservers-ro and api-ro.discovery.wmnet now resolve to failoid. details and rollback instructions: https://phabricator.wikimedia.org/T367949#9996177 [19:56:29] I added a new option to geoip/mediawiki::common. You can now toggle if an attempt is made to pull geoip data from "volatile" on a puppetmaster/puppetserver or not. Default is true. So nothing changed anywhere in production. But you can disable it which means you can have deployment_server and local project puppetserver in cloud VPS without having to setup fake volatile [19:57:06] (popped up as a puppet issue in multiple places ) [20:07:32] yea, it works. it finally made it possible to run puppet on a deployment_server instance in cloud VPS. like scap deploys in cloud [20:48:48] plugging Tyler's "sshecret" wrapper to ensure ssh-agent isn't forwarding prod keys to cloud: https://wikitech.wikimedia.org/w/index.php?title=SRE%2FProduction_access&diff=2207793&oldid=2174820 [20:49:23] https://github.com/thcipriani/sshecret/