[06:17:55] ^ I bet those rebuild failures are linked to the removal of `buster-backports` . The go lang images are using them see https://phabricator.wikimedia.org/T362518 [09:24:59] effie: I'd say roll back and understand later, no? [09:25:19] jayme: scap is not finished, and it seems that there is one container that is actally not starting [09:25:47] so I am not sure what to do, control c and use the previous deployment on mw-web? [09:26:11] which container and where? [09:26:14] mw-web in eqiad? [09:26:32] yes [09:27:30] here, sorry [09:28:13] mediawiki-main-httpd if failing readiness probe, yes [09:28:15] are you taking over, jayme? [09:28:28] jynus: we are yes [09:28:30] so there is only 1 person applying changes [09:28:45] jynus: status is I rolled out a change that it is going poorly [09:28:49] and we are mid-deployments [09:29:10] jayme: claime ctrl+c scap and do a rollback ? [09:29:14] yes [09:29:16] that is my suggestion [09:29:19] yeah, I am aware, I am just trying to help coordinating [09:29:29] jayme: agree? [09:29:42] it is causing a user visible outage [09:29:52] jynus: thank you, we are aware [09:29:54] please proceeed, effie [09:30:16] jayme: ping [09:30:31] sure [09:30:34] ok [09:30:43] going for rollback [09:30:55] no idea what scap does in that case - but better than this [09:31:01] yes [09:31:34] Yikes, I am big outage by wikipedia, wikivoyage and probably other hosteds. [09:31:44] Error: 502, Broken pipe at 2024-04-17 09:31:08 GMT [09:31:45] Guest67: we are aware [09:31:51] Guest67: see status page for updates [09:32:38] I created a status doc in case it's needed for notes: https://docs.google.com/document/d/1tGYINnf_POJ5r9YPwSrbBaKLuRYCt44MrmAsMb-Y-p4/edit#heading=h.95p2g5d67t9q [09:33:43] effie: how goes the rollback? [09:33:52] mw-web says happy helming [09:33:59] I am looking where it stands [09:34:24] Apologies, I see Jaime already created one here: https://docs.google.com/document/d/14oZQO7RQCvQGXrVS4L59SyaBKBrtw26P3K28qXmQnqw/edit#heading=h.95p2g5d67t9q. Disregard my previous link [09:34:33] sobanski: please don't create one, or ask the IC first [09:34:50] ACK [09:35:12] I will add the few bits I have witnessed to the timeline [09:35:13] mw-api-int is dead, I'll roll it back with helm [09:35:49] claime: jayme I think we should depool eqiad for reads [09:36:04] claime: I am doing ext [09:36:09] you do int [09:36:33] mw-web is not really happy even after the rollback [09:36:37] 4 available replicas [09:36:51] I think we hit some resource limit [09:36:58] rather than the change itself [09:38:46] I'm around, let me know if I can help on anything (gets out of the way) [09:39:10] How long is fixing time plase? We are europe time many people wanting to visit the website [09:39:26] Guest67: no estimation. We 'll know when we know [09:39:48] I see a lot of readiness probes failing [09:40:18] should we depool esams? [09:40:37] topranks: it's not esams only [09:40:39] topranks: this is eqiad, it wouldn't help to depool esams [09:40:44] topranks: the issue is with app servers, it won't change much [09:40:52] anything that points to eqiad right now for application servers has problems [09:40:53] ok np [09:40:58] esams is struggling more cause it has more traffic at the moment, but drmrs has the same issues [09:40:59] yeah brain fart [09:41:02] the cdn is just reporting the issues on the primary dc [09:41:11] sure [09:41:16] 8/9 containers ready for all pods, looking to what the non ready one is [09:41:43] topranks: https://grafana.wikimedia.org/goto/jj37o8aSR?orgId=1 --> drmrs suffering [09:41:49] We're going to depool eqiad from reads [09:42:01] yep [09:42:54] mediawiki-main-httpd terminates with Error 137 [09:43:01] so it's apache failing [09:43:10] 137 is oom IIRC [09:43:19] 137 is 128+9, so SIGKILL IIRC? [09:43:20] I think it's thundering herd maybe [09:43:38] [Wed Apr 17 09:36:49.303699 2024] [mpm_worker:error] [pid 1:tid 139839049483392] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [09:43:45] Is this global or maybe i can working by VPN from america / asia? [09:43:48] is what I see last from a killed container [09:44:28] claime is depooling eqiad for reads [09:44:44] Guest67: please refer to https://www.wikimediastatus.net/ for more information. we 'll be keeping it updated. [09:44:50] but it is global for some things [09:45:21] all mw-on-k8s read services depooled from eqiad [09:47:11] !incidents [09:47:11] 4614 (ACKED) ProbeDown sre (185.15.59.224 ip4 text-https:443 probes/service http_text-https_ip4 esams) [09:47:12] 4615 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [09:47:12] 4616 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [09:47:12] 4617 (ACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [09:47:12] 4618 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [09:47:12] 4613 (RESOLVED) [17x] ProbeDown sre (probes/service eqiad) [09:47:13] 4612 (RESOLVED) ProbeDown sre (10.2.1.81 ip4 mw-api-int:4446 probes/service http_mw-api-int_ip4 codfw) [09:47:13] 4611 (RESOLVED) [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad) [09:47:44] (MediaWikiHighErrorRate) resolved [09:47:55] more resolutions [09:48:16] just noticed coredns pods are in crashloopbackoff [09:48:43] confirm empirically [09:48:48] hnowlan: eqiad? [09:48:53] jayme: yeah [09:48:56] just getting killed though [09:49:15] but all are down [09:49:31] site loading for me now [09:49:42] same [09:50:20] ATS is still reporting 1k 503s per second globally from mw-web-ro [09:50:24] very slowly though [09:50:37] (getting better though) [09:50:57] not changing panel status yet [09:51:16] although things look better [09:51:40] deployments are recovering [09:51:42] coredns had liveness probes failing, thus being killed [09:51:44] there are some edits, but very low still [09:51:55] we have lost memcached starting at 9:15 https://logstash.wikimedia.org/goto/294875d0bc729e92ab93766e0dca76c6 [09:52:25] resolved: (7) Service mw-api-int:4446 [09:53:13] CDN reporting less http errors [09:53:30] 503s seem to have recovered accross the board [09:53:42] NEL, the same [09:53:46] for http [09:54:00] claime: still have some on restbase (compared to before the incident) [09:54:05] I think we are in a stable state right now, but we haven't repooled eqiad yet [09:54:15] vgutierrez: I was only looking at mw-on-k8s deployment [09:54:17] getting better too [09:54:28] claime: ack [09:54:37] thanks for looking at the rest <3 [09:55:09] I have pasted a few bits in the incident document but I am afraid I can't help in finding the cause or chain of events. I have long lost track of our infra :/ [09:55:22] only wikifeeds pending to resolve [09:55:32] GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [09:55:52] should I switch to "monitoring" without yet closing the issue? [09:56:02] wikifeeds trending downwards too [09:56:04] (on status page) thoughts? [09:56:17] hashar: we have an inkling of what happened [09:56:31] I will if nobody objects [09:56:51] Personally I'd say maybe hold off another few mins, perhaps I'm too cautious [09:57:05] hold off a bit [09:57:06] edits are recovering [09:57:08] ok [09:57:09] or wait until eqiad is repooled for reads? [09:57:10] jynus: keep it for another 5m, we have a fear this might move on to codfw [09:57:20] akosiaris: thanks, I wasn't aware [09:58:12] mw backend response time seems quite instable (maybe it is nothing, but seems weird) [09:58:26] it seems CPU requests got quite overloaded since 8:47 . In k8s events there are some "FailedScheduling 0/179 nodes are available: 173 Insufficient cpu" and spikes in CPU requests: https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=eqiad&var-prometheus=k8s&from=now-3h&to=now&viewPanel=3 [09:58:26] Requests seem to be back to normal now [09:59:16] jelto: I ran a deployment at that time [09:59:28] !log manually bump coredns in codfw to 6 [09:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:20] !log manually bump coredns in eqiad to 6 [10:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:26] jelto: Those are transient request spikes because of deployments [10:00:36] ok :) [10:00:57] how are codfw reads doing so far? [10:01:03] mw gets, I mean [10:02:17] I think you resolve issue now, but you don't update status page? [10:02:42] Guest67: we are waiting to be sure the issue doesn't pop up again somewhere else [10:02:54] Guest67: we've moved some pieces around and things are better but we're not quite ready to give things a green light [10:03:08] Is good for user to know, you should post "Trending for recovery" by status update? [10:06:17] can someone update me on what you are doing right now? [10:07:27] jynus: status is that we have depooled reads from eqiad [10:07:30] wikifeeds_cluster is still showing up as degraded [10:07:42] it's recovering [10:07:43] and we are working on pooling it back [10:07:50] I will do a roll-restart to bring it back quicker [10:07:50] thank you, effie [10:07:58] thank you also, hnowlan [10:08:33] will upgrade the status doc [10:08:34] jynus: the issue is ongoing [10:08:35] *update [10:08:49] yep, that was my understanding [10:09:03] doc != status page [10:09:24] jynus: we could update the status page to some degradation [10:09:35] we are almost fully operation [10:09:36] We are formalizing the bump from 4 to 6 replicas for coredns [10:09:52] jynus: yeah, move to monitoring. [10:09:54] I will set it as lightly degraded for edits and reads [10:09:57] oh [10:10:20] ok, setting that, then [10:10:23] yeah we are no longer degraded for all major workloads [10:11:40] https://www.wikimediastatus.net/ [10:11:50] I set it to degraded performance [10:11:53] πŸ‘ [10:11:59] as technically it is not wrong [10:12:07] we are running with half capacity for reads [10:12:11] ah, yeah, true. Until we repool eqiad [10:12:16] forgot about that, thanks [10:12:46] sorry it is hard to make everybody happy, going for a compromise after hearing everbody :-D [10:13:13] you are all doing a great job, btw [10:14:50] wikifeeds finally resolved [10:16:44] we will add things in the doc, what went on was that I pushed a change that makes mediawiki to make way more DNS resolutions than usual [10:17:01] so core dns was not able to handle that [10:17:36] ok - and the 1,000 ft. view of the fix is to increase the number of coredns instances? [10:17:45] topranks: that is what we did yes [10:17:59] we depooled eqiad, and then we bumped coredns on codfw [10:18:05] cool [10:19:11] https://usercontent.irccloud-cdn.com/file/P5f74PJh/grafik.png [10:19:26] effie: I pasted more or less your summary to the doc, but you will be able to edit it later [10:19:44] I am adding an impact summary and that is a good start [10:19:58] Amir1: I'll steal that [10:20:43] now we know what happens if we push the change on https://phabricator.wikimedia.org/T360029 forward [10:21:10] Amir1: let's not learn the wrong lesson from this [10:21:23] xD [10:21:41] but yep, we need to carefully consider the load involved for sure [10:21:56] for case of dbs, I'm sure we need at least a layer of APCu there [10:22:46] it's not hard to implement though [10:23:54] that's to cache locally? good idea in general yeah [10:24:06] yeah, we already do that in many places [10:24:24] maybe a general dns cache in mw could be implemented too [10:24:40] we don't have a local dns cache on our hosts, traditional /etc/resolv.conf setup [10:25:07] in APCu should be easy [10:25:08] could also consider something like systemd-resolved which implements a local cache at the system level [10:25:16] cool [10:25:42] effie: what should we do now, in your opinion, do you think the followup to repool eqiad will take long (aka- should we resolve now and your team work afterwards, or wait until repool? [10:26:29] jynus: we will update when we are in a solid state (pun intended) [10:26:37] he he [10:26:46] perfect from my side [10:46:25] Pooling back eqiad ro mw-on-k8s [10:46:46] (e.ffie is) [10:48:27] thank you, updating doc [10:57:48] jynus: we are fully operational [10:57:57] please update the page [10:58:15] nice! [10:58:21] thank you so much! [10:59:34] lots of <3 to the service ops team [11:01:28] not super urgen, but should I create a ticket for potential followups? [11:13:44] sure, we want to track the work we are already doing anyway [11:23:04] I created https://phabricator.wikimedia.org/T362766 [11:23:43] feel free to subscribe, add children tasks, edit the summary, etc. [11:41:38] another thing I noticed (also not urgent). I guess this is also related to the memcache work, I wonder if it can be acked? https://alerts.wikimedia.org/?q=alertname%3Dmemcached%20socket&q=team%3Dsre&q=%40receiver%3Dirc-spam [11:44:17] hello folks, I planned to move cassandra AQS codfw instances to PKI today, things seem stable but I can postpone if you feel so [11:53:20] I don't think that should impact ongoing improvements on app servers, and the issue was confirmed as stable atm [11:53:46] jynus: I acked it [11:53:52] effie: thank you! [11:58:53] elukey: not worried, being dev hosts, but I saw cassandra-dev2001 probes down FYI (maybe unrelated) https://alerts.wikimedia.org/?q=alertname%3DProbeDown&q=team%3Dsre&q=%40receiver%3Dirc-spam [12:03:17] jynus: ah snap didn't see it, thanks! [12:03:31] they are on PKI so will check what's wrong [12:04:10] weird the instance is up [12:10:03] jynus: fixed! I forgot to restart an instance the other day [12:20:44] my guess is they were used to test the change so notifying in case something was wrong. Great it was just that. [12:26:26] yes yes only pebcak :) [12:26:39] for aqs we also upgrded aqs1010 some days ago, nothing reported so far [12:48:14] on-call folks: we are going to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019810 to add magru to Puppet [12:48:26] should be smooth but :) [12:48:46] topranks is already on on-call so he can handle the pages [12:56:21] sukhe: let’s do it :) [12:57:02] starting! [12:58:19] there is no selective way to roll this out (as in disabling puppet on some hosts, given the extent of this change), so we will just apply on a few and stand-by as it rolls out [13:07:21] sukhe: while this is merged, it would be great if you could please attach the related phab task as a comment to the gerrit patch, as well as mention it on phab (if not already) [13:07:52] effie: thanks, updating [13:07:58] I since this is such a big change [13:08:11] yeah, we should have done that in the patch itself and we missed it [13:08:24] it happens [13:10:17] updated, thanks! [13:15:00] topranks: [13:15:00] $INCLUDE netbox/wikimedia.org-magru [13:15:03] we don't have this yet right? [13:15:22] it's in the patch I prepped [13:15:27] but not merged yet [13:15:29] right, wihch then adds up [13:15:34] ok, I will comment this for now then [13:16:04] actually - it's not there [13:16:07] I'll add to my patch [13:16:11] ok, thanks [13:16:25] I had: [13:16:43] $INCLUDE netbox/magru.wmnet [13:17:02] but left out the wikimedia.org one... there may be other forward zones we do like that [13:17:04] right, in the wmnet zone you had that [14:40:54] topranks: fabfur: https://puppetboard.wikimedia.org/report/install1004.wikimedia.org/e1f3cfaee6868e9b0319a69d157a2b6692db2edc [14:40:59] we forgot an entry here [14:43:26] https://gerrit.wikimedia.org/r/1020851 [14:43:52] please review [14:45:15] think should be ok, wait also for topranks review [14:45:20] ok [14:45:26] yep [14:45:45] thanks merging [14:45:49] doing ok if that's the only one [14:49:58] https://puppetboard.wikimedia.org/report/install1004.wikimedia.org/80b92af0951e56759a0854ae52704bb0d01ffafa looks fine now [14:50:46] πŸ‘ [15:27:10] sukhe, fabur: I am going to make some of the magru IPs in Netbox live now, then run the sre.dns.netbox cookbook without updating the dns servers [15:27:32] after which I'll merge https://gerrit.wikimedia.org/r/c/operations/dns/+/1020196 [15:27:35] topranks: sounds good but I am curious on the "not update the DNS server" part [15:27:47] that's a flag in the cookbook to not do that? [15:28:01] Yeah: --skip-authdns-update [15:28:08] ah nice TIL [15:28:13] it's made for this exact scenario (or maybe some other edge cases) [15:28:35] after the patch merge I'll do a quick santiy check and run authdns-update the old way from one of the hosts [15:28:54] ok [15:28:55] thanks. once this is done, we can merge the geo-resources patch and call it a day I guess [15:29:29] yeah, I've some other stuff I'm gonna do - merge the homer patches etc., but doesn't touch any traffic things I think [15:52:13] I failed to create any IPv4 address objects in Netbox within the two new public vlans in magru so added those and re-running netbox cookbook [15:52:45] 195.200.68.0/27 ? [15:53:22] yeah [15:53:29] and 195.200.68.32/27 [15:53:54] let's see what CI says now [15:54:43] need I have wondered :( [16:41:37] sukhe: I gave up on that separate file for the /29 in the end, seems to be almost there but getting this one last error [16:41:42] maybe a second set of eyes might spot [16:41:54] 17:37:10 error: rfc1035: Zone 10.in-addr.arpa.: Zonefile parse error at file /tmp/dns-check.o2ckx69w/zones/10.in-addr.arpa line 2999: General parse error [16:42:39] looking! I was also looking at why $INCLUDE netbox/128-29.58.15.185.in-addr.arpa [16:42:56] but there is a separate include for the 185.15.58.128/29? [16:43:04] so basically what you just did, is also there for drmrs? [16:43:36] is there? [16:43:38] ah ffs [16:43:39] ok [16:43:55] I must have looked at 185.15.59.128/27 (esams) instead [16:44:21] yeah I was looking at the drmrs one 58.15.185.in-addr.arpa [16:44:49] nope I was looking at .58 and grepping what the cookbook made and just missed it [16:44:52] looking at the other one now [16:45:18] head spinning at this stage with these :P [16:45:20] thanks [16:49:43] oh [16:49:49] line 682, # should be ; [16:49:57] in templates/10.in-addr.arpa [16:50:11] \o/ [16:50:40] hopefully this is it :) [16:50:43] thanks man I was losing it a bit [16:53:06] merge it :) [16:53:27] woot! [16:53:37] sorry you will hat eme [16:53:39] https://gerrit.wikimedia.org/r/c/operations/dns/+/1020196/12/templates/68.200.195.in-addr.arpa [16:53:46] you don't need your TODO comment in line 22 [16:53:51] what you did is already correct [16:53:59] but feel free to ignore and push the red (green?) button [16:54:56] oh shit [16:55:09] I'll send another patch for that shortly - good spot [16:55:51] ok, gonna run authdns-update now [16:55:58] here's hoping :) [16:56:20] <3 [16:57:54] seems to have gone fine! [16:59:00] cathal@officepc:~$ dig +noall +answer cr1-magru.wikimedia.org @1.1.1.1 [16:59:00] cr1-magru.wikimedia.org. 3600 IN A 195.200.68.128 [16:59:04] \o/ [16:59:16] :D [16:59:36] 12:59:26 [sukhe@azadi ~] dig -x 2a02:ec80:700:ed1a::2:b +short [16:59:36] upload-lb.magru.wikimedia.org. [17:01:06] You'll find the reverse for the IPv4 address of that isn't working though [17:01:15] I think needs delegation at the RIPE level [17:01:18] * topranks looking [17:01:57] not working indeed, yep [17:09:03] Hopefully getting better [17:09:07] https://www.irccloud.com/pastebin/YoRKQAkO/ [17:09:21] yep looking better indeed! [17:11:29] topranks: I am merging the geo-resources change [17:15:01] sukhe: ok yes should be ok I think [17:15:10] I am a little confused about the v4 PTRs though [17:15:11] https://phabricator.wikimedia.org/P60806 [17:15:42] getting 'refused' code back from our authdns - any idea why? [17:17:49] topranks: looking [17:19:11] topranks: what is 195.200.58.128? [17:19:27] should point to cr1-magru [17:19:32] 13:19:20 [sukhe@azadi ~] dig +nsid -x 195.200.68.224 @ns0.wikimedia.org +noall +answer [17:19:35] 224.68.200.195.in-addr.arpa. 3600 IN PTR text-lb.magru.wikimedia.org. [17:19:43] topranks: I don't see it on netbox too? [17:20:06] 148? [17:20:11] cmooney@dns2005:/etc/gdnsd/zones/netbox$ cat 128-29.68.200.195.in-addr.arpa [17:20:11] 128 1H IN PTR cr1-magru.wikimedia.org. [17:20:23] 68.148 [17:20:26] I think you have 58.148 above? [17:20:41] 128 sorry [17:20:57] text-lb is working [17:20:59] for me too [17:21:12] so the zone is working properly [17:21:15] let me check again [17:21:18] right [17:21:27] but cr1-magru is 195.200.68.128/32 [17:21:34] in the paste above, you are doing dig +nsid -x 195.200.58.128 @ns0.wikimedia.org [17:21:50] ;; ANSWER SECTION: [17:21:51] 128.68.200.195.in-addr.arpa. 3600 IN PTR cr1-magru.wikimedia.org. [17:21:52] 58, 68 got it :( [17:22:24] it's so easy to get all the numbers mixed up [17:22:40] yep, but it looks fixed now, so thanks! [17:22:40] we should have some system so we can use easy to remember, human-readable names instead [17:22:42] lol [17:22:45] and get the computers to translate! [17:23:10] thanks all for the help! [17:23:55] no, thank you! [17:23:56] 13:23:44 [sukhe@azadi ~] dig +subnet=195.200.68.0/24 en.wikipedia.org +noall +answer [17:23:59] en.wikipedia.org. 20338 IN CNAME dyna.wikimedia.org. [17:24:01] dyna.wikimedia.org. 283 IN A 195.200.68.224 [17:24:10] topranks: it's easy, just do `dig +nsid -x $(dig +short A cr1-magru.wikimedia.org)` [17:24:11] ok that's it for today right :) [17:24:21] ;) [17:24:55] that would require me to have not messed up the forward zone too :P [17:25:01] πŸ˜‚ [17:25:42] the answer is clearly that you're working too late in the day [17:26:10] yeah he has been putting in v6 PTRs all day, I am surprised he still has the energy to keep going :) [17:26:32] sukhe: I get fatigued after putting in merely half of one of those [17:26:41] same! [17:27:14] First place I worked I had to do fully-manual v6 reverse zones [17:27:28] plus we did our IPAM, including for v6, in a text file (no fancy excel or anything!) [17:27:43] it doesn't get any easier :D [17:28:53] topranks: at least in Excel if you made a mistake, you could undo easily, unlike you know, Netbox [17:28:58] ok I stop on the Netbox jokes now