[00:28:44] 06serviceops, 10FlaggedRevs: Migrate flaggedrevs jobs to mw-cron - https://phabricator.wikimedia.org/T388535#10818941 (10Scott_French) 05Open→03Resolved The first run at 00:08 today (May 14th) completed without issue, and with very similar total elapsed time to the bare-metal case (a bit more than 6m).... [08:02:29] o/ I'd like to deploy a quick cleanup to changeprop-jobqueue config (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1035732), how should I proceed? is this something I can deploy myself or is it better to ask one of you to do it? [08:19:52] dcausse: o/ I can help if you want.. usually what I do is to check the changeprop-jobqueue's dashboard first, to make sure that nothing is ongoing, and then I deploy to the standby DC first. Wait a bit for the dashboard to settle, and then deploy to the active one. [08:20:45] effie: o/ [08:20:49] elukey: sounds good, thanks! I can take care of this in ~10min if you're around to assist me if something goes wrong [08:20:57] sure! [08:21:43] effie: any plans to merge/deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1074168 ? It is blocking another rollout that I am doing :( [08:32:21] elukey: actually mw is being deployed, will wait for the train to finish and ping you when I start [08:34:03] yes better :) [08:50:06] elukey: let me understand why it is blocking you, I have been fiddling with this particular patch on and off [08:51:46] dcausse: you can use the mediawiki Infra window if you like, I have something to add to the calender today too, I can wait for you to go first [08:52:26] effie: thanks! [08:52:51] effie: I am upgrading mesh.configuration to 1.13, and sextant upgrades other things as part of it. One of it is app.job, that reverts https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1074177 for ipoid [08:52:59] but then helm fails to render the template [08:53:44] I can re-apply the tmp fix, but it will bite other folks in the future [08:58:23] I have an update for the module ready, to sort it [08:59:50] please let me take a look first [09:00:03] yep https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1074168 [09:00:10] that was my first link :) [09:00:22] I left a comment but it looks good even as is [09:00:51] I will read it all, if you give me some time please [09:05:51] effie: yes sorry it wasn't something like "I need this now", it was more "hey, let's do it in the next day or two if possible otherwise np" [09:06:42] I am happy to sort it today actually, it is no problem [09:07:46] elukey: no one else is using this module, I just need to verify that the update does right by ipoid [10:15:22] o/ I'm trying to rule out calico from the db connection pile up. If someone can confirm this, I'd appreciate it. Let me explain (warning, long text ahead): We have the incidents that randomly connections in sections start to pile up. In some sections, double or triple, in x1 it's usually tenfold. Here is an example: [10:15:22] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=$__all&var-shard=$__all&var-role=$__all&from=2025-05-11T09:17:54.766Z&to=2025-05-11T10:20:17.358Z&timezone=utc (go to connections). Usually it's a slow query. But nothing is showing up in logstash logs, traces, not even rdbms logs of before the incident. On top of that, the read handler stats in mariadb doesn't go up anymore: e.g. [10:15:22] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-05-11T09:17:54.000Z&to=2025-05-11T10:20:17.000Z&timezone=utc&var-job=$__all&var-server=db1179&var-port=9104&refresh=1m is normal now except many connections that are asleep. [10:15:22] Now the mw part. Looking at the graphs at the same time, the rate or anything doesn't go up: https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&from=2025-05-11T09:17:54.766Z&to=2025-05-11T10:20:17.358Z&timezone=utc&var-dc=000000017&var-namespace=mw-api-ext&var-release=main&var-container_name=$__all&var-site=arwiki&var-kubernetes_pod_name=All&refresh=1m but latency goes up out of nowhere. This corresponds with bumps [10:15:22] in cpu usage of calico https://grafana.wikimedia.org/d/2AfU0X_Mz/calico-resource-usage?orgId=1&from=2025-05-11T09:17:54.000Z&to=2025-05-11T10:20:17.000Z&timezone=utc&var-container_name=$__all&var-site=eqiad&var-prometheus=$__all but this might be just the symptom (at least I'm sure CPU is not busy). I also see a jump in tko in memcached but can't be sure: [10:15:22] https://grafana.wikimedia.org/d/ltSHWhHIk/mw-mcrouter?orgId=1&from=2025-05-11T09:17:54.000Z&to=2025-05-11T10:20:17.000Z&timezone=utc&var-site=eqiad&var-prometheus=k8s&var-memcached_server=$__all&var-kubernetes_namespace=mw-mcrouter&var-kubernetes_pod_name=$__all&var-container=mcrouter-main-mcrouter [10:19:28] Amir1: I have been tracking some if this behaviour here https://phabricator.wikimedia.org/T371881 [10:21:20] but it seems that this has gone worse than it used to be [10:21:45] Amir1: re: calico graph: at least the dashboard you linked averages over all clusters in one site, which is probably not what you want (includes ml, staging, aux, ...) [10:22:37] https://grafana.wikimedia.org/d/2AfU0X_Mz/calico-resource-usage?orgId=1&from=2025-05-11T09:17:54.000Z&to=2025-05-11T10:20:17.000Z&timezone=utc&var-container_name=$__all&var-site=eqiad&var-prometheus=k8s ? [10:22:41] and we had some issues with terminated connections around ferm restarts - which is, IIRC, also what effie talks about [10:23:00] Amir1: yeah, that's just prod wikikube in eqiad now [10:23:26] the dip in 9.55 UTC is worth looking into, unless there was a deployment or something [10:24:30] because there is a similar dip to the memcached traffic, and the TKOs registered may not be enough to justify [10:24:33] effie: that is the circuit breaking kicking in [10:24:44] just killing stuff until it recovers [10:25:04] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821027 (10cmooney) @akosiaris is there any update on this one? If I recall correctly from our discussion at the SRE Summit the curr... [10:28:51] Amir1: lets chat a bit today if you have time [10:29:36] sure, I have a couple of meetings soon but afterward sure [10:30:37] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821067 (10cmooney) [10:34:13] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821085 (10Vgutierrez) @cmooney that also implies increasing MTU on the LVS host as well, right? [10:41:29] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821122 (10cmooney) [11:00:50] 06serviceops, 13Patch-For-Review: sre.discovery cookbooks: refactor use of resolve_with_client_ip - https://phabricator.wikimedia.org/T393600#10821187 (10JMeybohm) 05Open→03Resolved [11:36:54] 06serviceops, 06DBA, 10Editing-team (Tracking), 10MW-1.44-notes (1.44.0-wmf.28; 2025-05-06), and 3 others: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, t... - https://phabricator.wikimedia.org/T393513#10821308 [11:57:02] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821401 (10cmooney) >>! In T352956#10821085, @Vgutierrez wrote: > @cmooney that also implies increasing MTU on the LVS host as well,... [12:03:46] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821441 (10JMeybohm) >>! In T352956#10821401, @cmooney wrote: >>>! In T352956#10821085, @Vgutierrez wrote: >> @cmooney that also impl... [12:07:38] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821470 (10Vgutierrez) That would be enough to accommodate IPv4 and IPv6? We currently clamp at 1440 bytes for ipv4 and at 1400 bytes... [12:22:03] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821513 (10Vgutierrez) Nevermind, we only do ipv4 for low-traffic/internal services [13:12:53] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821714 (10cmooney) >>! In T352956#10821470, @Vgutierrez wrote: > That would be enough to accommodate IPv4 and IPv6? We currently cla... [13:35:15] 06serviceops, 06DBA, 10Editing-team (Tracking), 10MW-1.44-notes (1.44.0-wmf.28; 2025-05-06), and 3 others: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, t... - https://phabricator.wikimedia.org/T393513#10821839 [13:46:41] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821894 (10akosiaris) >>! In T352956#10821027, @cmooney wrote: > @akosiaris is there any update on this one? > > If I recall correct... [16:09:00] 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: Migrate CentralAuth maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385866#10822999 (10Scott_French) The first post-migration run of purge-expired-userrights succeeded ea... [18:09:33] 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): Migrate CentralAuth maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385866#10823556 (10Scott_French) [18:10:45] 06serviceops, 10MediaWiki-extensions-CentralAuth, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): Migrate CentralAuth maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385866#10823559 (10Scott_French) 05Open→03Resolved The first post-migration hourly runs of centralauth-backfilllocal... [20:23:19] 06serviceops, 10Scap, 10Release-Engineering-Team (Priority Backlog 📥), 10Sustainability (Incident Followup): scap should check if it is running within a tmux/screen - https://phabricator.wikimedia.org/T361724#10824085 (10brennen) After discussion in a #spiderpig demo meeting, I'm noting here that this will...