[00:48:39] (03PS1) 10Ladsgroup: 404.php: Force a redirect to /wiki/ in very obvious cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) [00:55:04] (03PS2) 10Ladsgroup: 404.php: Force a redirect to /wiki/ in very obvious cases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) [00:59:17] (03CR) 10Ladsgroup: "(I put this on mw-experimental in eqiad and tested it. it looks nice)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [01:09:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1288283 [01:09:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1288283 (owner: 10TrainBranchBot) [01:21:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1288283 (owner: 10TrainBranchBot) [02:00:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:13:42] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:30:52] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:33:48] PROBLEM - MariaDB Replica Lag: m2 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:34:48] RECOVERY - MariaDB Replica Lag: m2 on db2160 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:32:01] (03PS1) 10Pppery: Allow Vector 2022 font size changes in namespace 100 for enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288370 (https://phabricator.wikimedia.org/T423766) [05:07:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:08:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc[2014,2024].codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance on pc4 [05:11:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:12:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:14:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:14:39] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [05:21:15] (03PS1) 10Marostegui: mariadb: Productionize pc2024 [puppet] - 10https://gerrit.wikimedia.org/r/1288387 (https://phabricator.wikimedia.org/T418973) [05:28:40] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize pc2024 [puppet] - 10https://gerrit.wikimedia.org/r/1288387 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [05:29:38] (03CR) 10Nikerabbit: [C:03+1] ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) (owner: 10Abijeet Patro) [05:30:22] (03CR) 10KartikMistry: [C:03+1] ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) (owner: 10Abijeet Patro) [05:36:51] (03CR) 10Marostegui: [C:03+1] sre.mysql.major-upgrade: Support reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [05:37:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11929240 (10Marostegui) Thank you so much! [05:41:53] (03PS1) 10Marostegui: db1169: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1288394 [05:42:15] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1288394 (owner: 10Marostegui) [05:42:38] (03CR) 10Marostegui: [C:03+2] db1169: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1288394 (owner: 10Marostegui) [06:00:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:33:03] !log installing Linux 6.12.88 on trixie hosts [06:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:27] FIRING: [43x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:23] !log installing openssl bugfix updates from trixie point release [06:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:43] (03PS1) 10Slyngshede: data.yaml offboarding kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1288427 [06:49:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1288427 (owner: 10Slyngshede) [06:49:39] !log installing glibc bugfix updates from trixie point release [06:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:20] (03CR) 10Slyngshede: [C:03+2] data.yaml offboarding kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1288427 (owner: 10Slyngshede) [06:54:08] (03PS1) 10Marostegui: instances.yaml: Remove pc2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1288428 (https://phabricator.wikimedia.org/T426555) [06:54:22] !log installing Linux 6.1.172 on bookworm hosts [06:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:49] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove pc2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1288428 (https://phabricator.wikimedia.org/T426555) (owner: 10Marostegui) [06:57:02] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Kgraessle out of all services on: 2468 hosts [06:59:16] !log installing systemd bugfix updates from trixie point release [06:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T0700). [07:00:05] hubaishan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove pc2013 from dbctl T426555', diff saved to https://phabricator.wikimedia.org/P92557 and previous config saved to /var/cache/conftool/dbconfig/20260518-070322-marostegui.json [07:03:26] T426555: decommission pc2013.codfw.wmnet - https://phabricator.wikimedia.org/T426555 [07:04:56] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts pc2013.codfw.wmnet [07:05:51] (03PS1) 10Marostegui: mariadb: Decommission pc2013 [puppet] - 10https://gerrit.wikimedia.org/r/1288433 (https://phabricator.wikimedia.org/T426555) [07:06:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) (owner: 10Abijeet Patro) [07:06:35] (03CR) 10Marostegui: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1288433 (https://phabricator.wikimedia.org/T426555) (owner: 10Marostegui) [07:06:43] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission pc2013 [puppet] - 10https://gerrit.wikimedia.org/r/1288433 (https://phabricator.wikimedia.org/T426555) (owner: 10Marostegui) [07:08:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:09:39] FIRING: DiskSpace: Disk space config-master2001:9100:/ 3.226% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:09:40] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [07:11:00] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for zsinger - https://phabricator.wikimedia.org/T426458#11929358 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF Hi @ZSinger-WMF access to the wmf group can be requests via https://idm.wikimedia.org/permissions/ [07:14:27] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:14:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pc2013.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [07:14:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:14:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2013.codfw.wmnet [07:23:03] (03PS1) 10MVernon: swift: restore 2 nodes to rings, drain final 2 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288441 (https://phabricator.wikimedia.org/T354872) [07:24:34] RESOLVED: DiskSpace: Disk space config-master2001:9100:/ 3.136% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:25:50] !log clean up space on cloudcumin1001: apt archives and older kernels [07:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:54] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2013.codfw.wmnet - https://phabricator.wikimedia.org/T426555#11929387 (10Marostegui) a:05Marostegui→03Jhancock.wm [07:27:03] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2013.codfw.wmnet - https://phabricator.wikimedia.org/T426555#11929392 (10Marostegui) Ready for DCOps [07:30:29] (03CR) 10Brouberol: [C:03+1] growtbook-next: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288237 (https://phabricator.wikimedia.org/T421800) (owner: 10Santiago Faci) [07:30:52] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:33:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426221#11929429 (10Jclark-ctr) 05Open→03Resolved [07:35:56] !log installing openssl bugfix updates from bookworm point release [07:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11929433 (10ayounsi) We have 2 new linecards coming, one for each router, so afaik we don't... [07:39:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [07:41:08] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission pc2013.codfw.wmnet - https://phabricator.wikimedia.org/T426555#11929442 (10Marostegui) [07:42:57] (03CR) 10Filippo Giunchedi: [C:03+2] Designate: move zookeeper config into hiera [puppet] - 10https://gerrit.wikimedia.org/r/1283000 (https://phabricator.wikimedia.org/T422646) (owner: 10Andrew Bogott) [07:45:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [07:46:53] !log installing systemd bugfix updates from bookworm point release [07:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:28] (03CR) 10Marostegui: [C:03+1] swift: restore 2 nodes to rings, drain final 2 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288441 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [07:49:48] (03CR) 10MVernon: [C:03+2] swift: restore 2 nodes to rings, drain final 2 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288441 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [07:53:14] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561 (10DSantamaria) 03NEW [07:57:39] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11929482 (10MatthewVernon) [07:59:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:01:56] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929500 (10SLyngshede-WMF) [08:03:18] (03CR) 10Santiago Faci: [C:03+2] growtbook-next: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288237 (https://phabricator.wikimedia.org/T421800) (owner: 10Santiago Faci) [08:04:58] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929504 (10SLyngshede-WMF) @thcipriani for your approval. [08:05:32] (03Merged) 10jenkins-bot: growtbook-next: New release that supports status as a filter for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288237 (https://phabricator.wikimedia.org/T421800) (owner: 10Santiago Faci) [08:07:00] (03CR) 10Joal: "Some confirmations needed from Ben, otherwise lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [08:07:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929507 (10SLyngshede-WMF) [08:07:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929508 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Medium [08:08:14] I'll be deploying gerrit:1287288, beta only. [08:09:59] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11929513 (10Marostegui) p:05Triage→03High [08:10:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11929514 (10Marostegui) Setting to high as this is candidate master - probably worth also switching to a different host as this one isn't stable and we don't know yet the cause of the crash. [08:10:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) (owner: 10Abijeet Patro) [08:11:38] (03Merged) 10jenkins-bot: ULS rewrite: Enable on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287288 (https://phabricator.wikimedia.org/T426288) (owner: 10Abijeet Patro) [08:12:12] !log installing glibc bugfix updates from bookworm point release [08:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:14:09] kart_, will do a quick check [08:14:37] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [08:15:37] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [08:19:03] 06SRE, 10Pywikibot, 06Traffic, 10Wikidata, and 3 others: Pywikibot reports maxlag retry error - https://phabricator.wikimedia.org/T421642#11929560 (10Xqt) >>! In T421642#11928938, @Arcstur wrote: > @Xqt is it possible to check maxlag PER user-agent? I guess that coul help us sort thing out. I found this i... [08:19:05] (03PS1) 10MVernon: swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288458 [08:19:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [08:22:09] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929569 (10SLyngshede-WMF) [08:23:29] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11929576 (10SLyngshede-WMF) a:03SLyngshede-WMF SSH key verified out of band. [08:24:02] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.major-upgrade: Support reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [08:26:59] (03PS4) 10Federico Ceratto: sre.mysql.major-upgrade: Support reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) [08:27:06] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.major-upgrade: Support reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [08:29:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by javiermonton@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [08:30:58] (03PS1) 10Ayounsi: Add alerting for high/low optics power level [alerts] - 10https://gerrit.wikimedia.org/r/1288462 [08:30:59] (03Merged) 10jenkins-bot: stream: mediawiki.page_html_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287366 (https://phabricator.wikimedia.org/T423920) (owner: 10JavierMonton) [08:31:14] !log javiermonton@deploy1003 Started scap sync-world: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]] [08:31:18] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [08:34:04] (03PS2) 10Atsuko: opensearch-cluster: full permission for anonymous users [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287920 (https://phabricator.wikimedia.org/T426073) [08:34:40] (03CR) 10Jcrespo: [C:03+1] swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288458 (owner: 10MVernon) [08:34:59] (03CR) 10MVernon: [C:03+2] swift: remove 2 drained eqiad backends for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1288458 (owner: 10MVernon) [08:36:59] (03CR) 10Ayounsi: "Once deployed, we can disable the matching librenms alerts" [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [08:39:24] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:40:32] (03PS3) 10CWilliams: mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344) [08:40:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye [08:41:05] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11929772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1082.eq... [08:41:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1082 [08:42:50] (03CR) 10Atsuko: [C:03+2] opensearch-cluster: full permission for anonymous users [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287920 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [08:44:20] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [08:44:24] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:46:25] (03Merged) 10jenkins-bot: opensearch-cluster: full permission for anonymous users [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287920 (https://phabricator.wikimedia.org/T426073) (owner: 10Atsuko) [08:47:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11929863 (10ArthurTaylor) Hi @Dzahn, I can confirm that I now see data in Superset. That didn't seem to work for me before opening this ticket, so I don't know what chang... [08:50:05] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [08:50:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [08:50:14] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [08:50:14] !log javiermonton@deploy1003 javiermonton: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:50:19] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [08:50:21] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-toolhub-test: apply [08:50:57] !log javiermonton@deploy1003 javiermonton: Continuing with deployment [08:52:03] mvernon@cumin2002 reimage (PID 1520565) is awaiting input [08:52:43] (03PS1) 10Hashar: Unarchive the repository [debs/gdnsd] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1288469 [08:53:33] (03CR) 10Hashar: [V:03+2 C:03+2] Unarchive the repository [debs/gdnsd] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1288469 (owner: 10Hashar) [08:55:56] (03PS1) 10Ayounsi: Remove now gone include [dns] - 10https://gerrit.wikimedia.org/r/1288473 [08:56:20] (03CR) 10CWilliams: [C:03+2] mariadb: Decommission db2152 [puppet] - 10https://gerrit.wikimedia.org/r/1287414 (https://phabricator.wikimedia.org/T424344) (owner: 10CWilliams) [08:56:41] (03CR) 10CI reject: [V:04-1] Remove now gone include [dns] - 10https://gerrit.wikimedia.org/r/1288473 (owner: 10Ayounsi) [08:58:24] (03PS2) 10Ayounsi: Remove now gone includes [dns] - 10https://gerrit.wikimedia.org/r/1288473 [08:59:42] !log cwilliams@cumin1003 START - Cookbook sre.hosts.decommission for hosts db2152.codfw.wmnet [08:59:58] (03CR) 10Btullis: [C:03+2] "Thanks. I believe that we can tell that Spark 3.1 is unaffected, since all of the DAGs on https://airflow-analytics-test.wikimedia.org are" [puppet] - 10https://gerrit.wikimedia.org/r/1287837 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:01:08] (03CR) 10MVernon: [C:03+1] Remove now gone includes [dns] - 10https://gerrit.wikimedia.org/r/1288473 (owner: 10Ayounsi) [09:01:23] (03CR) 10Ayounsi: [C:03+2] Remove now gone includes [dns] - 10https://gerrit.wikimedia.org/r/1288473 (owner: 10Ayounsi) [09:01:41] !log ayounsi@dns1004 START - running authdns-update [09:02:49] !log javiermonton@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287366|stream: mediawiki.page_html_content_change (T423920)]] (duration: 31m 35s) [09:02:52] T423920: Streaming HTML & Edit Types - productionization checklist - https://phabricator.wikimedia.org/T423920 [09:03:16] !log ayounsi@dns1004 END - running authdns-update [09:04:33] !log cwilliams@cumin1003 START - Cookbook sre.dns.netbox [09:05:49] !log jnuche@deploy1003 Installing scap version "4.265.2" for 163 host(s) [09:06:56] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1082 - mvernon@cumin2002" [09:07:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1082 - mvernon@cumin2002" [09:07:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:07:32] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1082.eqiad.wmnet 52.32.64.10.in-addr.arpa 2.5.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:07:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1082.eqiad.wmnet 52.32.64.10.in-addr.arpa 2.5.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:07:37] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1082 [09:07:56] 06SRE, 06Traffic: ASW single-point of failure for LVS VIPs at POPs - https://phabricator.wikimedia.org/T362772#11930012 (10ayounsi) 05Resolved→03Open a:05cmooney→03None Reopening as I think that should still be on Traffic's radar and up to traffic to close the task. I agree that the best long term fix... [09:08:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1082 [09:08:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1082 [09:08:26] !log cwilliams@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2152.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [09:08:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2152.codfw.wmnet decommissioned, removing all IPs except the asset tag one - cwilliams@cumin1003" [09:08:32] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:08:33] !log cwilliams@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2152.codfw.wmnet [09:10:19] !log jnuche@deploy1003 Installing scap version "4.265.2" for 1 host(s) [09:11:16] !log jnuche@deploy1003 Installation of scap version "4.265.2" completed for 1 hosts [09:13:09] !log Removing db2152.codfw.wmnet from zarcillo T424344 [09:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:12] T424344: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344 [09:13:29] 06SRE, 06Infrastructure-Foundations: Could not resolve hostname bast5004.wikimedia.org: nodename nor servname provided, or not known - https://phabricator.wikimedia.org/T426488#11930035 (10LSobanski) [09:13:47] (03CR) 10Brouberol: [C:03+2] stream: webrequest-page-view [puppet] - 10https://gerrit.wikimedia.org/r/1287906 (https://phabricator.wikimedia.org/T426425) (owner: 10JavierMonton) [09:18:04] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426503#11930052 (10Jclark-ctr) a:03Jclark-ctr [09:18:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426503#11930053 (10Jclark-ctr) ps1-c4-eqiad.mgmt.eqiad.wmnet #1: Sensor: Phase, BA:L2-L3, Active Power Value: 1.896 kW (power) Thresholds: High: 1650 [09:18:53] !log installing Java 21 security updates [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1086.eqiad.wmnet with OS bullseye [09:20:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11930061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1086.eq... [09:20:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be1086 [09:21:09] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [09:25:06] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1086 - mvernon@cumin2002" [09:25:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage [09:25:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be1086 - mvernon@cumin2002" [09:25:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:25:12] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be1086.eqiad.wmnet 18.32.64.10.in-addr.arpa 8.1.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:25:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be1086.eqiad.wmnet 18.32.64.10.in-addr.arpa 8.1.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:25:17] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1086 [09:25:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1086 [09:25:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be1086 [09:26:09] (03CR) 10Btullis: [C:03+2] Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:28:00] (03PS5) 10Btullis: Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) [09:28:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage [09:29:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:49] !log Removing db2152.codfw.wmnet from orchestrator T424344 [09:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:52] T424344: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344 [09:30:50] (03CR) 10Btullis: [C:03+2] Add a hadoop::spark35 profile and deploy it alongside hadoop::spark3 [puppet] - 10https://gerrit.wikimedia.org/r/1287855 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:35:50] (03PS1) 10Btullis: Fix the conda-analytics-next prefix for spark35 [puppet] - 10https://gerrit.wikimedia.org/r/1288486 (https://phabricator.wikimedia.org/T338057) [09:36:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [09:37:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [09:37:52] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2332-2374].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [09:37:53] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: ResourceQuotaMemoryLimitsWarning - https://phabricator.wikimedia.org/T426589 (10LSobanski) 03NEW [09:38:02] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2332-2336].codfw.wmnet [09:38:52] (03CR) 10Btullis: [C:03+2] Fix the conda-analytics-next prefix for spark35 [puppet] - 10https://gerrit.wikimedia.org/r/1288486 (https://phabricator.wikimedia.org/T338057) (owner: 10Btullis) [09:39:29] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344#11930168 (10CWilliams-WMF) a:05CWilliams-WMF→03None [09:39:34] (03PS1) 10Sergio Gimeno: fix(signup.js): Do not warn about a username being available [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288487 (https://phabricator.wikimedia.org/T419401) [09:40:07] (03PS1) 10Marostegui: instances.yaml: Add pc2024 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1288488 (https://phabricator.wikimedia.org/T418973) [09:40:27] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: NodeTextfileStale (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T424001#11930181 (10LSobanski) @MoritzMuehlenhoff I still see the alerts on https://alerts.wikimedia.org/triage/. Will they go away after the next... [09:40:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288487 (https://phabricator.wikimedia.org/T419401) (owner: 10Sergio Gimeno) [09:41:05] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2332-2336].codfw.wmnet [09:41:29] PROBLEM - Host ms-be1086 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [09:43:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1086.eqiad.wmnet with reason: host reimage [09:46:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1288489 (https://phabricator.wikimedia.org/T426590) [09:46:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1082.eqiad.wmnet with OS bullseye [09:46:31] RECOVERY - Host ms-be1086 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:46:38] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA, 13Patch-For-Review: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11930197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for... [09:47:39] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2332-2336].codfw.wmnet [09:47:43] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2332-2336].codfw.wmnet [09:47:55] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2337-2341].codfw.wmnet [09:50:57] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2337-2341].codfw.wmnet [09:52:06] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 40 hosts with reason: Primary switchover s4 T426590 [09:52:09] T426590: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T426590 [09:52:19] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2240 with weight 0 T426590', diff saved to https://phabricator.wikimedia.org/P92558 and previous config saved to /var/cache/conftool/dbconfig/20260518-095218-fceratto.json [09:57:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [09:57:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [09:57:39] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2337-2341].codfw.wmnet [09:57:43] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2337-2341].codfw.wmnet [09:57:55] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2342-2346].codfw.wmnet [09:58:37] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1288489 (https://phabricator.wikimedia.org/T426590) (owner: 10Gerrit maintenance bot) [09:59:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1086.eqiad.wmnet with OS bullseye [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1000) [10:00:09] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA, 13Patch-For-Review: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11930241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for... [10:00:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:42] !log Starting s4 codfw failover from db2179 to db2240 - T426590 [10:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:45] T426590: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T426590 [10:00:58] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2342-2346].codfw.wmnet [10:02:03] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2240 to s4 primary T426590', diff saved to https://phabricator.wikimedia.org/P92559 and previous config saved to /var/cache/conftool/dbconfig/20260518-100203-fceratto.json [10:05:11] (03PS1) 10Effie Mouzeli: site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) [10:05:26] (03PS1) 10Kosta Harlan: hCaptcha: Drop addurl trigger and 100% passive mode SiteKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288494 (https://phabricator.wikimedia.org/T426587) [10:07:10] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set correct weight T426590', diff saved to https://phabricator.wikimedia.org/P92560 and previous config saved to /var/cache/conftool/dbconfig/20260518-100710-fceratto.json [10:07:11] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [10:07:14] T426590: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T426590 [10:07:25] (03CR) 10Marostegui: "@cwilliams@wikimedia.org looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/1288488 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:07:41] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2342-2346].codfw.wmnet [10:07:45] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2342-2346].codfw.wmnet [10:07:57] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2347-2351].codfw.wmnet [10:08:11] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [10:08:24] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [10:08:32] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2179 (T419635)', diff saved to https://phabricator.wikimedia.org/P92561 and previous config saved to /var/cache/conftool/dbconfig/20260518-100831-fceratto.json [10:08:35] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [10:09:01] (03CR) 10Muehlenhoff: site.pp add mc107[0-2]* memecached servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:09:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [10:09:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [10:11:00] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2347-2351].codfw.wmnet [10:11:10] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [10:11:51] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA, 13Patch-For-Review: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11930295 (10MatthewVernon) [10:12:18] (03CR) 10Dreamy Jazz: [C:03+1] "Seems fine to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288494 (https://phabricator.wikimedia.org/T426587) (owner: 10Kosta Harlan) [10:12:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install new line card in cr2-eqiad slot 0, move card from slot 1 to cr1-eqiad slot 0 and configure - https://phabricator.wikimedia.org/T426343#11930298 (10cmooney) >>! In T426343#11929433, @ayounsi wrote: > We have 2 new linecards comi... [10:14:10] (03PS1) 10MVernon: swift: restore 2 reimaged hosts, drain next 2 [puppet] - 10https://gerrit.wikimedia.org/r/1288496 (https://phabricator.wikimedia.org/T421719) [10:14:43] (03CR) 10CWilliams: [C:03+1] instances.yaml: Add pc2024 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1288488 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:14:50] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add pc2024 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1288488 (https://phabricator.wikimedia.org/T418973) (owner: 10Marostegui) [10:16:30] (03CR) 10Effie Mouzeli: [C:04-1] "there is an oops here" [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:17:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2024 to dbctl T418973', diff saved to https://phabricator.wikimedia.org/P92562 and previous config saved to /var/cache/conftool/dbconfig/20260518-101714-marostegui.json [10:17:18] T418973: Productionize pc20[21-24] and pc10[21-24] - https://phabricator.wikimedia.org/T418973 [10:17:43] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2347-2351].codfw.wmnet [10:17:47] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2347-2351].codfw.wmnet [10:17:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add pc2024 to pc4 master T418973', diff saved to https://phabricator.wikimedia.org/P92563 and previous config saved to /var/cache/conftool/dbconfig/20260518-101749-marostegui.json [10:17:55] (03PS2) 10Effie Mouzeli: site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) [10:17:59] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2352-2356].codfw.wmnet [10:18:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc2024: replacing hw [10:18:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [10:18:19] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.parsercache (exit_code=99) [10:18:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc2024: replacing hw [10:18:22] (03CR) 10Federico Ceratto: [C:03+1] "The change matches the summary" [puppet] - 10https://gerrit.wikimedia.org/r/1288496 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [10:18:31] (03CR) 10CI reject: [V:04-1] site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:19:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc4', diff saved to https://phabricator.wikimedia.org/P92564 and previous config saved to /var/cache/conftool/dbconfig/20260518-101917-marostegui.json [10:21:01] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2352-2356].codfw.wmnet [10:22:47] (03PS1) 10Marostegui: pc2024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1288498 [10:23:06] (03PS3) 10Effie Mouzeli: site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) [10:25:01] (03CR) 10MVernon: [C:03+2] swift: restore 2 reimaged hosts, drain next 2 [puppet] - 10https://gerrit.wikimedia.org/r/1288496 (https://phabricator.wikimedia.org/T421719) (owner: 10MVernon) [10:25:05] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2152.codfw.wmnet - https://phabricator.wikimedia.org/T424344#11930352 (10CWilliams-WMF) Ready for DC-Ops [10:25:25] (03CR) 10Effie Mouzeli: site.pp add mc107[0-2]* memecached servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:26:19] (03CR) 10Marostegui: [C:03+2] pc2024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1288498 (owner: 10Marostegui) [10:27:49] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2352-2356].codfw.wmnet [10:27:51] (03PS1) 10Effie Mouzeli: mcrouter_wancache: add mc1070-mc1071 to production [puppet] - 10https://gerrit.wikimedia.org/r/1288500 (https://phabricator.wikimedia.org/T418263) [10:27:53] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2352-2356].codfw.wmnet [10:27:53] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2332-2374].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [10:28:31] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11930359 (10kostajh) [10:31:31] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11930367 (10MatthewVernon) [10:33:35] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2001-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [10:33:41] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2001-2002,2005].codfw.wmnet [10:34:24] (03PS1) 10Marostegui: installserver: Remove pc2024 [puppet] - 10https://gerrit.wikimedia.org/r/1288501 [10:35:24] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2001-2002,2005].codfw.wmnet [10:36:07] (03CR) 10Blake: [C:03+1] site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:36:34] (03PS2) 10Marostegui: installserver: Remove pc202[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1288501 [10:37:03] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [10:37:13] (03CR) 10Effie Mouzeli: [C:03+2] site.pp add mc107[0-2]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1288493 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:37:18] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host idp-test2005.wikimedia.org [10:37:33] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [10:37:38] (03PS3) 10Marostegui: installserver: Remove pc202[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1288501 [10:38:14] (03PS1) 10CWilliams: orchestrator: Add cwilliams to orchestrator PowerAuthUsers [puppet] - 10https://gerrit.wikimedia.org/r/1288503 (https://phabricator.wikimedia.org/T426596) [10:39:15] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288504 [10:39:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288504 (owner: 10Matthias Mullie) [10:40:14] (03CR) 10Marostegui: [C:03+2] installserver: Remove pc202[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1288501 (owner: 10Marostegui) [10:41:22] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org [10:41:42] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp-test1005.wikimedia.org [10:42:27] (03CR) 10Marostegui: [C:03+1] orchestrator: Add cwilliams to orchestrator PowerAuthUsers [puppet] - 10https://gerrit.wikimedia.org/r/1288503 (https://phabricator.wikimedia.org/T426596) (owner: 10CWilliams) [10:42:40] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2001-2002,2005].codfw.wmnet [10:42:43] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2001-2002,2005].codfw.wmnet [10:42:51] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2006,2011-2012].codfw.wmnet [10:43:42] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:54] (03PS2) 10CWilliams: orchestrator: Add cwilliams to orchestrator PowerAuthUsers [puppet] - 10https://gerrit.wikimedia.org/r/1288503 (https://phabricator.wikimedia.org/T426596) [10:45:10] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2006,2011-2012].codfw.wmnet [10:45:36] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [10:45:40] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1005.wikimedia.org [10:45:59] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp2005.wikimedia.org [10:46:04] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [10:46:35] (03CR) 10CWilliams: [C:03+2] orchestrator: Add cwilliams to orchestrator PowerAuthUsers [puppet] - 10https://gerrit.wikimedia.org/r/1288503 (https://phabricator.wikimedia.org/T426596) (owner: 10CWilliams) [10:48:26] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2005.wikimedia.org [10:50:51] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe [10:51:53] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2006,2011-2012].codfw.wmnet [10:51:55] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2006,2011-2012].codfw.wmnet [10:51:56] (03PS1) 10Slyngshede: IDP: Host switch-over [dns] - 10https://gerrit.wikimedia.org/r/1288508 [10:52:04] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2013-2015].codfw.wmnet [10:53:48] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2013-2015].codfw.wmnet [10:54:10] (03CR) 10Slyngshede: [C:03+2] IDP: Host switch-over [dns] - 10https://gerrit.wikimedia.org/r/1288508 (owner: 10Slyngshede) [10:54:29] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1070.eqiad.wmnet with OS bullseye [10:54:44] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1071.eqiad.wmnet with OS bullseye [10:54:58] !log slyngshede@dns1004 START - running authdns-update [10:55:18] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1072.eqiad.wmnet with OS bullseye [10:56:32] !log slyngshede@dns1004 END - running authdns-update [11:00:28] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2013-2015].codfw.wmnet [11:00:30] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idp1005.wikimedia.org [11:00:31] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2013-2015].codfw.wmnet [11:00:40] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2016-2018].codfw.wmnet [11:02:24] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2016-2018].codfw.wmnet [11:03:50] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11930466 (10ayounsi) @Papaul did they provide a ETA for the fix? If not is it possible to ask them for any update ? [11:04:14] (03PS1) 10Clément Goubert: data.yaml: Change cgoubert ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1288515 [11:04:28] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1005.wikimedia.org [11:05:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5005.wikimedia.org [11:05:26] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idm-test1001.wikimedia.org [11:06:17] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [11:06:30] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [11:06:46] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [11:09:08] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2016-2018].codfw.wmnet [11:09:11] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2016-2018].codfw.wmnet [11:09:18] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2015.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:09:20] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2033-2035].codfw.wmnet [11:09:24] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idm-test1001.wikimedia.org [11:09:53] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idm2001.wikimedia.org [11:10:03] RESOLVED: [5x] KubernetesCalicoDown: wikikube-worker2006.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:10:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [11:11:04] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2033-2035].codfw.wmnet [11:14:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5005.wikimedia.org [11:14:05] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idm2001.wikimedia.org [11:14:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [11:15:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4006.wikimedia.org [11:17:31] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2033-2035].codfw.wmnet [11:17:33] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2033-2035].codfw.wmnet [11:17:42] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2036-2038].codfw.wmnet [11:17:52] (03PS1) 10Slyngshede: IDM: host switch-over [dns] - 10https://gerrit.wikimedia.org/r/1288519 [11:17:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [11:18:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4006.wikimedia.org [11:19:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2003.codfw.wmnet [11:19:27] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2036-2038].codfw.wmnet [11:19:43] (03CR) 10Slyngshede: [C:03+2] IDM: host switch-over [dns] - 10https://gerrit.wikimedia.org/r/1288519 (owner: 10Slyngshede) [11:19:51] !log slyngshede@dns1004 START - running authdns-update [11:21:25] !log slyngshede@dns1004 END - running authdns-update [11:23:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2003.codfw.wmnet [11:24:00] !log slyngshede@cumin1003 START - Cookbook sre.hosts.reboot-single for host idm1001.wikimedia.org [11:24:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1070.eqiad.wmnet with OS bullseye [11:26:39] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2036-2038].codfw.wmnet [11:26:42] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2036-2038].codfw.wmnet [11:26:51] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2039,2041-2042].codfw.wmnet [11:27:58] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idm1001.wikimedia.org [11:29:13] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2039,2041-2042].codfw.wmnet [11:30:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1071.eqiad.wmnet with OS bullseye [11:30:16] (03PS1) 10Raymond Ndibe: write_replica_cnf: pass data via standard input [puppet] - 10https://gerrit.wikimedia.org/r/1288521 (https://phabricator.wikimedia.org/T424209) [11:30:52] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:32:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1072.eqiad.wmnet with OS bullseye [11:36:13] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2039,2041-2042].codfw.wmnet [11:36:16] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2039,2041-2042].codfw.wmnet [11:36:26] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2044,2046,2049].codfw.wmnet [11:38:11] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2044,2046,2049].codfw.wmnet [11:38:37] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1072.eqiad.wmnet with OS bookworm [11:39:23] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1071.eqiad.wmnet with OS bookworm [11:39:56] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1070.eqiad.wmnet with OS bookworm [11:40:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1288522 (https://phabricator.wikimedia.org/T426600) [11:41:20] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1069.eqiad.wmnet with OS bookworm [11:42:36] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:42:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2024.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:42:52] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2024.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2019.codfw.wmnet are marked down but pooled ht [11:42:52] kitech.wikimedia.org/wiki/PyBal [11:42:54] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [11:42:54] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [11:42:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:08] (03PS1) 10Ayounsi: Add profile::server_depool policy for k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288524 (https://phabricator.wikimedia.org/T327300) [11:43:10] (03PS1) 10Ayounsi: Add depool policy for all insetup roles and ml_cache::storage [puppet] - 10https://gerrit.wikimedia.org/r/1288525 (https://phabricator.wikimedia.org/T327300) [11:43:12] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:12] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:20] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:20] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:22] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:26] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:43:36] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:43:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:10] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [11:44:54] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:02] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.866 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1003.eqiad.wmnet [11:45:10] (03PS1) 10Raymond Ndibe: handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) [11:45:10] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:12] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:12] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [11:45:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:45:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:45:22] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2044,2046,2049].codfw.wmnet [11:45:25] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2044,2046,2049].codfw.wmnet [11:45:33] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2050-2051,2055].codfw.wmnet [11:46:12] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:12] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:12] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:12] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:14] (03CR) 10Cathal Mooney: Add alerting for high/low optics power level (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [11:46:22] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:46:28] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:46:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T426600 [11:46:45] T426600: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T426600 [11:46:53] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2213 with weight 0 T426600', diff saved to https://phabricator.wikimedia.org/P92565 and previous config saved to /var/cache/conftool/dbconfig/20260518-114652-fceratto.json [11:47:10] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [11:47:10] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Swift [11:47:12] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.889 second response time https://wikitech.wikimedia.org/wiki/Swift [11:47:12] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 361 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Swift [11:47:16] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2050-2051,2055].codfw.wmnet [11:47:20] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:47:20] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:47:56] (03CR) 10CI reject: [V:04-1] handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [11:47:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:04] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.411 second response time https://wikitech.wikimedia.org/wiki/Swift [11:48:10] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:48:12] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1003.eqiad.wmnet [11:49:18] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.500 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:20] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.592 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:20] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 9.060 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:28] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:28] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:30] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.683 second response time https://wikitech.wikimedia.org/wiki/Swift [11:49:36] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:49:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:58] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Swift [11:50:06] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:50:08] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:50:12] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [11:50:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:50:14] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.056 second response time https://wikitech.wikimedia.org/wiki/Swift [11:50:20] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:50:36] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1288522 (https://phabricator.wikimedia.org/T426600) (owner: 10Gerrit maintenance bot) [11:50:39] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [11:50:53] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [11:51:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:51:16] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.503 second response time https://wikitech.wikimedia.org/wiki/Swift [11:51:18] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 5.203 second response time https://wikitech.wikimedia.org/wiki/Swift [11:51:18] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.203 second response time https://wikitech.wikimedia.org/wiki/Swift [11:51:20] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:51:20] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:51:22] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 9.249 second response time https://wikitech.wikimedia.org/wiki/Swift [11:51:49] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [11:52:02] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 3.713 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:02] !log Starting s5 codfw failover from db2192 to db2213 - T426600 [11:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:05] T426600: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T426600 [11:52:12] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.378 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:12] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:18] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.188 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:18] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.399 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:20] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 9.335 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:26] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [11:52:31] jouncebot: nowandnext [11:52:32] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [11:52:32] In 1 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1300) [11:52:36] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:52:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [11:52:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [11:53:05] !log fceratto@cumin1003 dbctl commit (dc=all): 'Promote db2213 to s5 primary T426600', diff saved to https://phabricator.wikimedia.org/P92566 and previous config saved to /var/cache/conftool/dbconfig/20260518-115304-fceratto.json [11:53:08] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:53:08] (03PS1) 10Slyngshede: Fix alignment of reset password page [software/bitu] - 10https://gerrit.wikimedia.org/r/1288529 (https://phabricator.wikimedia.org/T425552) [11:53:10] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [11:53:12] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:12] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:14] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.967 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:16] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 4.685 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:22] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:53:26] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:26] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [11:53:38] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:53:38] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:53:42] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2050-2051,2055].codfw.wmnet [11:53:45] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2050-2051,2055].codfw.wmnet [11:53:54] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2056-2058].codfw.wmnet [11:54:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin2003.codfw.wmnet [11:54:12] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.588 second response time https://wikitech.wikimedia.org/wiki/Swift [11:54:12] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1072.eqiad.wmnet with reason: host reimage [11:54:20] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:54:22] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:54:28] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [11:54:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:55:14] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [11:55:14] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [11:55:14] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [11:55:28] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:55:38] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2056-2058].codfw.wmnet [11:55:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288494 (https://phabricator.wikimedia.org/T426587) (owner: 10Kosta Harlan) [11:55:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:14] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:14] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:14] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:14] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:14] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.937 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:14] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:16] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.299 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:18] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.827 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:24] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:56:24] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:56:24] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:56:28] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:28] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:28] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:34] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 5.245 second response time https://wikitech.wikimedia.org/wiki/Swift [11:56:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [11:57:02] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 3.486 second response time https://wikitech.wikimedia.org/wiki/Swift [11:57:14] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [11:57:14] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Swift [11:57:28] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [11:57:38] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:58:10] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:58:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:58:14] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [11:58:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1070.eqiad.wmnet with reason: host reimage [11:58:24] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:58:34] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.114 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:02] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [11:59:18] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [12:00:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2003.codfw.wmnet [12:00:16] PROBLEM - Swift https backend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:00:26] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:00:26] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:00:26] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:00:30] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.922 second response time https://wikitech.wikimedia.org/wiki/Swift [12:00:38] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:00:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:16] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:18] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:18] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.689 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:26] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 8.509 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:26] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:01:30] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:30] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:01:38] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:02:18] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.259 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1071.eqiad.wmnet with reason: host reimage [12:02:20] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.077 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:24] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2056-2058].codfw.wmnet [12:02:24] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.435 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:24] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 8.327 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:26] !log Ran `scap remove-patch --message-body 'Dropping patch already made public' /srv/patches/next/extensions/ConfirmEdit/01-T423840.patch` [12:02:27] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2056-2058].codfw.wmnet [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:30] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:35] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2059-2061].codfw.wmnet [12:02:36] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.372 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:16] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:16] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:16] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:18] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:18] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:24] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.909 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:30] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [12:03:30] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.851 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:04:18] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:18] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:18] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:19] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2059-2061].codfw.wmnet [12:04:34] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.121 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:16] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.351 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:16] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:18] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.534 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:18] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 1.139 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:18] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:18] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:30] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:18] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:18] PROBLEM - Swift https backend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:38] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:06:38] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:07:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:07:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1069.eqiad.wmnet with reason: host reimage [12:07:13] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:07:17] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:07:27] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:07:31] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [12:07:33] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 3.429 second response time https://wikitech.wikimedia.org/wiki/Swift [12:07:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:08:03] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:19] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:19] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:19] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:19] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:19] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:08:27] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:08:29] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:02] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11930706 (10Raine) [12:09:11] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 8.312 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:09:17] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:17] RECOVERY - Swift https backend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:19] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.841 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:19] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.831 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:19] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:21] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.468 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:21] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 2.638 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:24] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:09:25] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.679 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:29] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:09:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [12:09:35] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.887 second response time https://wikitech.wikimedia.org/wiki/Swift [12:10:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1072.eqiad.wmnet with OS bookworm [12:10:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:10:13] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:10:19] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:03] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:19] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:19] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:19] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:11:31] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2059-2061].codfw.wmnet [12:11:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288042 (https://phabricator.wikimedia.org/T426526) (owner: 10Hubaishan) [12:11:34] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2059-2061].codfw.wmnet [12:11:43] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2062,2064-2065].codfw.wmnet [12:11:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:19] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:19] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:19] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:19] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:29] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 362 bytes in 0.438 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:29] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.167 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:29] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Swift [12:12:45] 06SRE, 06Infrastructure-Foundations: Could not resolve hostname bast5004.wikimedia.org: nodename nor servname provided, or not known - https://phabricator.wikimedia.org/T426488#11930726 (10SLyngshede-WMF) 05Open→03Invalid p:05Triage→03Low That host no longer exists and have been replaced by bast500... [12:13:07] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 4.146 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:13:19] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:19] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:19] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:28] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2062,2064-2065].codfw.wmnet [12:13:31] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.399 second response time https://wikitech.wikimedia.org/wiki/Swift [12:13:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [12:13:43] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:43] PROBLEM - Host sretest2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:03] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:11] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1070.eqiad.wmnet with OS bookworm [12:14:19] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:19] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:19] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:21] PROBLEM - Swift https frontend on ms-fe2022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:23] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.281 second response time https://wikitech.wikimedia.org/wiki/Swift [12:14:24] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:14:28] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:15:03] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:03] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:09] 06SRE, 06Infrastructure-Foundations: Could not resolve hostname bast5004.wikimedia.org: nodename nor servname provided, or not known - https://phabricator.wikimedia.org/T426488#11930732 (10SLyngshede-WMF) Bastion host map has been updated. [12:15:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:15:17] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:18] 06SRE, 06Infrastructure-Foundations: Could not resolve hostname bast5004.wikimedia.org: nodename nor servname provided, or not known - https://phabricator.wikimedia.org/T426488#11930733 (10SLyngshede-WMF) 05Invalid→03Resolved [12:15:21] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:21] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.423 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:29] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [12:15:39] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:16:01] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [12:16:05] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:16:15] RECOVERY - Host sretest2006 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [12:16:23] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.560 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:33] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 3.285 second response time https://wikitech.wikimedia.org/wiki/Swift [12:16:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:19] RECOVERY - Swift https backend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:19] RECOVERY - Swift https frontend on ms-fe2022 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:19] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:23] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.067 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:23] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.861 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:29] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:29] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:17:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor-dev2001.codfw.wmnet [12:17:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:05] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.197 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:19] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:19] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:19] PROBLEM - Swift https frontend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:19] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:21] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.137 second response time https://wikitech.wikimedia.org/wiki/Swift [12:18:45] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1071.eqiad.wmnet with OS bookworm [12:19:25] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.524 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:31] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:31] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 1.980 second response time https://wikitech.wikimedia.org/wiki/Swift [12:19:39] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:20:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 2.110 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [12:20:13] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 9.908 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:15] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T419635)', diff saved to https://phabricator.wikimedia.org/P92567 and previous config saved to /var/cache/conftool/dbconfig/20260518-122014-fceratto.json [12:20:18] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:20:19] PROBLEM - Swift https backend on ms-fe2024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:19] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:25] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 6.715 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:27] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 8.056 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:29] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 9.932 second response time https://wikitech.wikimedia.org/wiki/Swift [12:20:29] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:20:42] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2062,2064-2065].codfw.wmnet [12:20:45] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2062,2064-2065].codfw.wmnet [12:20:55] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2067-2069].codfw.wmnet [12:21:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:21:19] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:19] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:19] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:21] (03CR) 10Btullis: [C:03+2] Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [12:21:37] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.964 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:37] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 8.510 second response time https://wikitech.wikimedia.org/wiki/Swift [12:21:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor-dev2001.codfw.wmnet [12:21:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:21:59] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:22:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [12:22:13] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:19] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:20] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:20] RECOVERY - Swift https backend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:21] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:21] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:29] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:29] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [12:22:32] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1069.eqiad.wmnet with OS bookworm [12:22:37] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2067-2069].codfw.wmnet [12:22:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:03] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [12:23:19] RECOVERY - Swift https frontend on ms-fe2024 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [12:23:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:19] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [12:24:53] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610 (10Gehel) 03NEW [12:25:44] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Follow up on multiple RAID / drive issues - https://phabricator.wikimedia.org/T426610#11930788 (10Gehel) [12:25:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11930791 (10Gehel) [12:25:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11930790 (10Gehel) [12:25:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11930789 (10Gehel) [12:25:57] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11930792 (10Gehel) [12:26:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11930793 (10Gehel) [12:26:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [12:26:27] PROBLEM - Host sretest1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:26:51] RESOLVED: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:27:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11930797 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:06] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11930803 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:22] (03CR) 10Btullis: Presto memory tuning, resource groups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [12:27:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11930810 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11930816 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T424654#11930822 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [12:27:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11930828 (10Gehel) 05Open→03Resolved Follow up work is happening on sub task (T426610) [12:27:57] RECOVERY - Host sretest1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [12:28:47] (03CR) 10Btullis: [C:03+1] opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [12:30:23] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P92568 and previous config saved to /var/cache/conftool/dbconfig/20260518-123022-fceratto.json [12:31:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [12:31:53] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2067-2069].codfw.wmnet [12:31:56] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2067-2069].codfw.wmnet [12:31:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11930846 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:32:05] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2070-2072].codfw.wmnet [12:32:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1288529 (https://phabricator.wikimedia.org/T425552) (owner: 10Slyngshede) [12:32:28] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11930849 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:32:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [12:32:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:33:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-04-24 - 2026-05-15): Degraded RAID on an-worker1205 - https://phabricator.wikimedia.org/T422317#11930851 (10Jclark-ctr) a:05brouberol→03Jclark-ctr [12:34:14] (03PS2) 10Raymond Ndibe: handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) [12:34:23] (03CR) 10Btullis: [C:03+1] "Agreed. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1240730 (owner: 10Muehlenhoff) [12:34:27] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2070-2072].codfw.wmnet [12:34:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7002.wikimedia.org [12:35:51] (03CR) 10Btullis: [C:03+1] Revert^2 "zramswap: notify service on config change" [puppet] - 10https://gerrit.wikimedia.org/r/1218805 (owner: 10CDanis) [12:36:17] (03CR) 10CI reject: [V:04-1] handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [12:38:26] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [12:40:30] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P92569 and previous config saved to /var/cache/conftool/dbconfig/20260518-124030-fceratto.json [12:40:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7002.wikimedia.org [12:41:12] (03CR) 10Slyngshede: [C:03+2] Fix alignment of reset password page [software/bitu] - 10https://gerrit.wikimedia.org/r/1288529 (https://phabricator.wikimedia.org/T425552) (owner: 10Slyngshede) [12:41:47] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2070-2072].codfw.wmnet [12:41:50] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2070-2072].codfw.wmnet [12:41:58] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2073-2075].codfw.wmnet [12:43:32] (03PS1) 10Atsuko: admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288832 (https://phabricator.wikimedia.org/T348763) [12:43:43] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2073-2075].codfw.wmnet [12:45:47] (03PS1) 10Atsuko: deployment_server: add eventstreams-internal to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1288502 (https://phabricator.wikimedia.org/T348763) [12:45:54] (03Merged) 10jenkins-bot: Fix alignment of reset password page [software/bitu] - 10https://gerrit.wikimedia.org/r/1288529 (https://phabricator.wikimedia.org/T425552) (owner: 10Slyngshede) [12:47:01] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1288835 (owner: 10L10n-bot) [12:47:47] (03CR) 10Btullis: [C:03+1] deployment_server: add eventstreams-internal to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1288502 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:48:00] (03CR) 10Btullis: [C:03+1] admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288832 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:48:40] (03CR) 10Atsuko: [C:03+2] deployment_server: add eventstreams-internal to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1288502 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:49:42] (03CR) 10Raymond Ndibe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) (owner: 10Raymond Ndibe) [12:50:31] (03PS3) 10Raymond Ndibe: handle missing kubeconfig error in replica_cnf_backend [puppet] - 10https://gerrit.wikimedia.org/r/1288526 (https://phabricator.wikimedia.org/T424207) [12:50:36] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2073-2075].codfw.wmnet [12:50:39] !log fceratto@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T419635)', diff saved to https://phabricator.wikimedia.org/P92570 and previous config saved to /var/cache/conftool/dbconfig/20260518-125038-fceratto.json [12:50:39] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2073-2075].codfw.wmnet [12:50:42] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [12:50:52] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2076-2078].codfw.wmnet [12:51:31] (03PS2) 10Muehlenhoff: Remove analytics::cluster_packages spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240730 [12:51:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7002.magru.wmnet [12:52:20] (03CR) 10Atsuko: [C:03+2] admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288832 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [12:53:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288494 (https://phabricator.wikimedia.org/T426587) (owner: 10Kosta Harlan) [12:55:36] (03Merged) 10jenkins-bot: hCaptcha: Drop addurl trigger and 100% passive mode SiteKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288494 (https://phabricator.wikimedia.org/T426587) (owner: 10Kosta Harlan) [12:55:57] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1288494|hCaptcha: Drop addurl trigger and 100% passive mode SiteKey (T426587)]] [12:55:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7002.magru.wmnet [12:56:05] T426587: hCaptcha: API edits are incorrectly using the always-challenge mode for the addurl action - https://phabricator.wikimedia.org/T426587 [12:57:06] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2076-2078].codfw.wmnet [12:57:16] (03PS2) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288504 [12:58:24] (03PS1) 10Marostegui: mariadb: Change link for troubleshooing lag. [puppet] - 10https://gerrit.wikimedia.org/r/1288845 [12:58:33] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11930998 (10Papaul) @ayounsi no ETA was given to me but yes i can can follow up with them. [12:59:30] (03CR) 10LSobanski: [C:03+1] mariadb: Change link for troubleshooing lag. [puppet] - 10https://gerrit.wikimedia.org/r/1288845 (owner: 10Marostegui) [12:59:45] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1288494|hCaptcha: Drop addurl trigger and 100% passive mode SiteKey (T426587)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:00:03] (03CR) 10Ladsgroup: [C:03+1] mariadb: Change link for troubleshooing lag. [puppet] - 10https://gerrit.wikimedia.org/r/1288845 (owner: 10Marostegui) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1300). [13:00:04] Daimona, Sergi0, matthiasmullie, kostajh, and hubaishan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:20] o/ [13:00:21] o/ [13:00:22] I’m finishing up my deploy now [13:00:49] !log kharlan@deploy1003 kharlan: Continuing with deployment [13:00:58] So we can scratch that one off the list :) [13:01:11] When https://spiderpig.wikimedia.org/jobs/2016 is done, please feel free to start the other patches [13:01:38] (03Merged) 10jenkins-bot: admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288832 (https://phabricator.wikimedia.org/T348763) (owner: 10Atsuko) [13:01:58] alright, I’ll start gate-and-submit for sergi0 already [13:02:07] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288487 (https://phabricator.wikimedia.org/T419401) (owner: 10Sergio Gimeno) [13:02:39] perfect, ty Lucas_WMDE [13:02:43] sergi0: do you want to deploy your change yourself or should I? [13:03:22] If you can go ahead and I'll test 🙏 [13:03:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [13:03:26] ok [13:03:44] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2076-2078].codfw.wmnet [13:03:47] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2076-2078].codfw.wmnet [13:03:56] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2087-2089].codfw.wmnet [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:42] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2087-2089].codfw.wmnet [13:06:40] o/ (sorry was AFK) [13:07:13] hi Daimona [13:07:25] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288494|hCaptcha: Drop addurl trigger and 100% passive mode SiteKey (T426587)]] (duration: 11m 27s) [13:07:28] T426587: hCaptcha: API edits are incorrectly using the always-challenge mode for the addurl action - https://phabricator.wikimedia.org/T426587 [13:07:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [13:07:38] BTW I will need a deployer for my change, but then I can run the queries myself [13:07:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288487 (https://phabricator.wikimedia.org/T419401) (owner: 10Sergio Gimeno) [13:07:42] ok [13:07:52] deploying sergi0 first, then Daimona, then matthiasmullie [13:07:59] haven’t seen hubaishan yet [13:08:15] I am here [13:09:42] ok [13:09:47] (03Merged) 10jenkins-bot: fix(signup.js): Do not warn about a username being available [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288487 (https://phabricator.wikimedia.org/T419401) (owner: 10Sergio Gimeno) [13:10:07] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1288487|fix(signup.js): Do not warn about a username being available (T419401)]] [13:10:10] T419401: Add live username validation to mobile account creation form - https://phabricator.wikimedia.org/T419401 [13:11:50] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, sgimeno: Backport for [[gerrit:1288487|fix(signup.js): Do not warn about a username being available (T419401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:00] * sergi0 testing [13:12:52] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2087-2089].codfw.wmnet [13:12:55] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2087-2089].codfw.wmnet [13:13:05] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2090-2092].codfw.wmnet [13:13:37] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1068.eqiad.wmnet with OS bookworm [13:14:21] Lucas_WMDE: lgtm [13:14:47] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2090-2092].codfw.wmnet [13:15:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, sgimeno: Continuing with deployment [13:15:20] okay, thanks! [13:16:08] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1067.eqiad.wmnet with OS bookworm [13:16:29] (03CR) 10Lucas Werkmeister (WMDE): "For the record: the previous collation is `uppercase`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288042 (https://phabricator.wikimedia.org/T426526) (owner: 10Hubaishan) [13:18:51] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:19:25] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288487|fix(signup.js): Do not warn about a username being available (T419401)]] (duration: 09m 18s) [13:19:29] T419401: Add live username validation to mobile account creation form - https://phabricator.wikimedia.org/T419401 [13:19:40] alright, I’ll do hubaishan next [13:19:54] ready [13:19:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288042 (https://phabricator.wikimedia.org/T426526) (owner: 10Hubaishan) [13:19:59] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:21:05] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:21:17] (03CR) 10Muehlenhoff: [C:03+2] Remove analytics::cluster_packages spec test [puppet] - 10https://gerrit.wikimedia.org/r/1240730 (owner: 10Muehlenhoff) [13:21:24] (03Merged) 10jenkins-bot: [config] Set Category Collation for arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288042 (https://phabricator.wikimedia.org/T426526) (owner: 10Hubaishan) [13:21:26] (03CR) 10Ayounsi: [C:03+2] Add profile::server_depool policy for k8s hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288524 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:21:32] (03CR) 10Ayounsi: [C:03+2] Add depool policy for all insetup roles and ml_cache::storage [puppet] - 10https://gerrit.wikimedia.org/r/1288525 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [13:21:35] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2090-2092].codfw.wmnet [13:21:36] (03PS1) 10Brouberol: global_config: add ldap-ro-{eqiad,codfw} to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1288855 (https://phabricator.wikimedia.org/T420691) [13:21:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5003.eqsin.wmnet [13:21:38] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2090-2092].codfw.wmnet [13:21:38] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1288042|[config] Set Category Collation for arwikisource (T426526)]] [13:21:43] T426526: Change Category Collation in ar.wikisource.org - https://phabricator.wikimedia.org/T426526 [13:21:46] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2093-2095].codfw.wmnet [13:21:57] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:22:59] (03PS1) 10Brouberol: airflow-test-k8s: add the ldap-ro task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288856 (https://phabricator.wikimedia.org/T420691) [13:23:06] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1288855 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:23:22] !log lucaswerkmeister-wmde@deploy1003 hubaishan, lucaswerkmeister-wmde: Backport for [[gerrit:1288042|[config] Set Category Collation for arwikisource (T426526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:31] hubaishan: please test [13:23:32] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2093-2095].codfw.wmnet [13:23:43] (though I’m not sure if this can be completely tested before I run the maintenance script) [13:23:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5003.eqsin.wmnet [13:25:44] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [13:25:59] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288855 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:27:23] hubaishan: are you testing the change? [13:27:40] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [13:27:52] Lucas_WMDE  no change do you run updateCollation.php [13:28:06] I’ll run that after the change is fully deployed [13:28:13] if it can’t be tested until then then let’s just roll it out [13:28:51] !log lucaswerkmeister-wmde@deploy1003 hubaishan, lucaswerkmeister-wmde: Continuing with deployment [13:29:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1068.eqiad.wmnet with reason: host reimage [13:29:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11931112 (10Dzahn) @ArthurTaylor Ok, cool, glad to hear it works:) Let me set this to resolved then. If you run into anything let us know. [13:30:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for ArthurTaylor - https://phabricator.wikimedia.org/T424317#11931115 (10Dzahn) 05In progress→03Resolved [13:30:12] (03PS1) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:30:43] (03CR) 10CI reject: [V:04-1] Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:30:51] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2093-2095].codfw.wmnet [13:30:54] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2093-2095].codfw.wmnet [13:30:56] (03CR) 10Brouberol: [C:03+2] global_config: add ldap-ro-{eqiad,codfw} to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1288855 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:31:02] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2102-2104].codfw.wmnet [13:31:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [13:32:15] ^ that patch can be bundled with something else (or I can sync it towards the end of the window) [13:32:35] (03PS2) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:33:02] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288042|[config] Set Category Collation for arwikisource (T426526)]] (duration: 11m 24s) [13:33:05] T426526: Change Category Collation in ar.wikisource.org - https://phabricator.wikimedia.org/T426526 [13:33:06] (03CR) 10CI reject: [V:04-1] Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:33:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1067.eqiad.wmnet with reason: host reimage [13:33:14] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: updateCollation arwikisource --previous-collation=uppercase # T426526 [13:33:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2218'] [13:33:35] * Lucas_WMDE checks how many pages arwikisource has [13:33:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2218'] [13:33:52] 287k… okay this maintenance script will take a moment [13:34:02] but not ages either, it’s already at 22k [13:34:14] !log tchin@deploy1003 Started deploy [analytics/refinery@ba10fca] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ba10fcad] [13:34:18] let’s deploy Daimona and kostajh in the meantime [13:34:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:34:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) (owner: 10Daimona Eaytoy) [13:34:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [13:35:20] it is OK [13:36:08] !log tchin@deploy1003 Finished deploy [analytics/refinery@ba10fca] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ba10fcad] (duration: 01m 54s) [13:36:10] (03Merged) 10jenkins-bot: .gitignore: Add /static/hcaptcha/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) (owner: 10Ahmon Dancy) [13:36:20] Lucas_WMDE: thanks! [13:36:51] !log tchin@deploy1003 Started deploy [analytics/refinery@ba10fca]: Regular analytics weekly train [analytics/refinery@ba10fcad] [13:36:58] (03PS3) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:37:08] sheesh, Zuul ETA for that CampaignEvents change is 11 minutes… [13:37:17] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2102-2104].codfw.wmnet [13:39:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:40:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:40:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:41:14] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:41:37] !log updateCollation arwikisource for T426526 finished [13:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:42] T426526: Change Category Collation in ar.wikisource.org - https://phabricator.wikimedia.org/T426526 [13:41:57] !log tchin@deploy1003 Finished deploy [analytics/refinery@ba10fca]: Regular analytics weekly train [analytics/refinery@ba10fcad] (duration: 05m 05s) [13:41:58] (turns out it went all the way up to cl_from = 304000 so I guess the statistics page count is a bit too low) [13:42:07] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:42:09] !log tchin@deploy1003 Started deploy [analytics/refinery@ba10fca] (thin): Regular analytics weekly train THIN [analytics/refinery@ba10fcad] [13:42:44] (03PS2) 10Ayounsi: Add alerting for high/low optics power level [alerts] - 10https://gerrit.wikimedia.org/r/1288462 [13:44:02] (03PS4) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:44:05] !log tchin@deploy1003 Finished deploy [analytics/refinery@ba10fca] (thin): Regular analytics weekly train THIN [analytics/refinery@ba10fcad] (duration: 01m 55s) [13:44:05] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: add the ldap-ro task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288856 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:44:24] (03CR) 10Ayounsi: Add alerting for high/low optics power level (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [13:44:28] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2102-2104].codfw.wmnet [13:44:30] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2102-2104].codfw.wmnet [13:44:38] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2105-2107].codfw.wmnet [13:44:49] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1068.eqiad.wmnet with OS bookworm [13:44:54] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: add the ldap-ro task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288856 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:45:42] (03Merged) 10jenkins-bot: Store uncomputed references delta as null, not 0 [extensions/CampaignEvents] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287895 (https://phabricator.wikimedia.org/T426002) (owner: 10Daimona Eaytoy) [13:45:45] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11931273 (10Jhancock.wm) working on this now. the bios and the firmware were out of date. will see two reboots. [13:46:03] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1287895|Store uncomputed references delta as null, not 0 (T426002)]], [[gerrit:1287026|.gitignore: Add /static/hcaptcha/ (T403829)]] [13:46:08] T426002: Set references delta to null for existing events and until proper computation logic exists - https://phabricator.wikimedia.org/T426002 [13:46:08] T403829: hCaptcha: Self-host secure-api.js code - https://phabricator.wikimedia.org/T403829 [13:46:23] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2105-2107].codfw.wmnet [13:46:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:46:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11931281 (10Jhancock.wm) @Marostegui does this need to be depooled before i do this? [13:47:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11931283 (10Marostegui) @Jhancock.wm you can go for it anytime. [13:47:51] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde, dancy: Backport for [[gerrit:1287895|Store uncomputed references delta as null, not 0 (T426002)]], [[gerrit:1287026|.gitignore: Add /static/hcaptcha/ (T403829)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:00] kostajh: I’m guessing nothing to test for your .gitignore change? [13:48:03] Daimona: please test :) [13:48:24] Doing [13:48:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1067.eqiad.wmnet with OS bookworm [13:50:10] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [13:50:59] (03PS5) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:51:29] I actually don't know if it's testable because processing happens via the jobqueue. You can go ahead and I'll test again once it's fully live [13:51:38] (Also, in a meeting, apologies for delays etc) [13:51:43] ok [13:51:44] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde, dancy: Continuing with deployment [13:51:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:53:02] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2105-2107].codfw.wmnet [13:53:04] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2105-2107].codfw.wmnet [13:53:14] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2108-2110].codfw.wmnet [13:54:23] (03CR) 10Marostegui: [C:03+2] mariadb: Change link for troubleshooing lag. [puppet] - 10https://gerrit.wikimedia.org/r/1288845 (owner: 10Marostegui) [13:54:34] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1066.eqiad.wmnet with OS bookworm [13:54:48] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1065.eqiad.wmnet with OS bookworm [13:55:18] (03PS1) 10Brouberol: global_config: add gerrit.w.o to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1288864 (https://phabricator.wikimedia.org/T420691) [13:55:34] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2108-2110].codfw.wmnet [13:56:00] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287895|Store uncomputed references delta as null, not 0 (T426002)]], [[gerrit:1287026|.gitignore: Add /static/hcaptcha/ (T403829)]] (duration: 09m 57s) [13:56:05] T426002: Set references delta to null for existing events and until proper computation logic exists - https://phabricator.wikimedia.org/T426002 [13:56:05] T403829: hCaptcha: Self-host secure-api.js code - https://phabricator.wikimedia.org/T403829 [13:56:08] (03PS1) 10Brouberol: airflow-test-k8s: add the gerrit task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288865 (https://phabricator.wikimedia.org/T420691) [13:56:19] matthiasmullie: over to you… there’s not much time left in the window but I guess you could use the 30-minute gap before Test Kitchen Experiment Deployment Window [13:56:24] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288864 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [13:56:49] Lucas_WMDE: thanks [13:57:18] Unless anyone objects in the next couple of minutes, I will continue with the last patch in this window (and thus slightly overrun the window) [13:57:39] I would say start now [13:57:54] between gate-and-submit and the l10n cache rebuild, it’ll take long enough already :/ [13:58:04] should be plenty of time to abort scap if anyone objects [13:58:11] aight let's go [13:58:14] (03PS6) 10Andrew Bogott: Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) [13:58:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [13:58:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288504 (owner: 10Matthias Mullie) [13:58:52] !log bking@deploy1003 Started deploy [wdqs/wdqs@e8fb00c]: 0.3.163 [14:00:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:01:57] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288504 (owner: 10Matthias Mullie) [14:02:15] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1288504|Squashed diff to master]] [14:02:18] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11931362 (10Raine) This appears to be a problem with the A... [14:03:20] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2108-2110].codfw.wmnet [14:03:23] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2108-2110].codfw.wmnet [14:03:31] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2111-2113].codfw.wmnet [14:05:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:05:31] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1288504|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:05:44] Can I go ahead and run the query to fixup the data for T426002? [14:05:44] T426002: Set references delta to null for existing events and until proper computation logic exists - https://phabricator.wikimedia.org/T426002 [14:06:16] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [14:06:22] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [14:07:58] Daimona: what is the query [14:08:24] !log mlitn@deploy1003 mlitn: Continuing with deployment [14:08:56] The one in the task description, it's an UPDATE [14:08:57] (03CR) 10Btullis: [C:03+1] global_config: add gerrit.w.o to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1288864 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [14:09:20] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: add the gerrit task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288865 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [14:09:21] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1066.eqiad.wmnet with reason: host reimage [14:09:41] emapping thumbsize of 0 to 2 in all group1 wikis (T376152) [14:09:42] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [14:09:45] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2111-2113].codfw.wmnet [14:10:01] !log mapping thumbsize of 0 to 2 in all group1 wikis (T376152) [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:10:28] Daimona: Put a limit of 10 first [14:10:45] run it, make sure things look correct, then run it with limit of 2K a couple of times [14:11:31] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [14:12:15] Okay, makes sense, I'll do it in 10-20 minutes and ask if I need help. And update here ofc [14:12:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:13:23] (03CR) 10Brouberol: [C:03+2] global_config: add gerrit.w.o to external-services [puppet] - 10https://gerrit.wikimedia.org/r/1288864 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [14:13:25] sounds good [14:13:27] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1065.eqiad.wmnet with reason: host reimage [14:13:31] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: add the gerrit task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288865 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [14:13:58] !log mlitn@deploy1003 Rolling back deployment [14:15:27] ^ Logstash checker failures. I don't think they're related, but given we're already outside deployment window, I'm rolling back [14:16:05] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2179: Repooling after switchover [14:16:10] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2179: Repooling after switchover [14:16:40] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2111-2113].codfw.wmnet [14:16:43] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2111-2113].codfw.wmnet [14:16:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe [14:16:53] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2114-2115,2124].codfw.wmnet [14:17:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:18:11] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2192: Repooling after switchover [14:18:16] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2192: Repooling after switchover [14:18:36] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2114-2115,2124].codfw.wmnet [14:19:00] (03PS1) 10Matthias Mullie: Revert "Squashed diff to master" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288871 [14:19:20] (03CR) 10Matthias Mullie: [C:03+2] Revert "Squashed diff to master" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288871 (owner: 10Matthias Mullie) [14:20:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Investigate db2218 crash - https://phabricator.wikimedia.org/T426383#11931453 (10Jhancock.wm) @Marostegui everythings updated. all yours! [14:22:12] (03Merged) 10jenkins-bot: Revert "Squashed diff to master" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288871 (owner: 10Matthias Mullie) [14:22:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:22:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:22:20] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288504|Squashed diff to master]] (duration: 20m 05s) [14:22:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:22:58] (03PS1) 10CWilliams: mariadb: Decommission db2150 [puppet] - 10https://gerrit.wikimedia.org/r/1288874 (https://phabricator.wikimedia.org/T424342) [14:23:14] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [14:24:07] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:24:11] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11931467 (10ATsay-WMF) I approve this request [14:25:00] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1066.eqiad.wmnet with OS bookworm [14:25:38] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2114-2115,2124].codfw.wmnet [14:25:39] (03CR) 10Cathal Mooney: [C:03+1] "Lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [14:25:41] (03PS1) 10CWilliams: mariabd: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) [14:25:41] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2114-2115,2124].codfw.wmnet [14:25:55] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2125-2127].codfw.wmnet [14:27:11] (03CR) 10Ayounsi: [C:03+2] Add alerting for high/low optics power level [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [14:27:38] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2125-2127].codfw.wmnet [14:28:25] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:53] (03Merged) 10jenkins-bot: Add alerting for high/low optics power level [alerts] - 10https://gerrit.wikimedia.org/r/1288462 (owner: 10Ayounsi) [14:29:47] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1065.eqiad.wmnet with OS bookworm [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1430) [14:30:40] !log bking@deploy1003 Finished deploy [wdqs/wdqs@e8fb00c]: 0.3.163 (duration: 31m 47s) [14:31:15] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 [14:31:34] !log jiji@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1328-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:31:41] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1328-1330].eqiad.wmnet [14:31:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [14:32:08] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:32:18] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:32:22] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2143.codfw.wmnet with reason: Depooled host, will be decommissioned [14:32:38] !log cwilliams@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2149.codfw.wmnet with reason: Depooled host, will be decommissioned [14:33:04] sorry, fixed [14:33:21] !log fceratto@cumin1003 dbctl commit (dc=all): 'Set db2192 weight', diff saved to https://phabricator.wikimedia.org/P92573 and previous config saved to /var/cache/conftool/dbconfig/20260518-143320-fceratto.json [14:33:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1328-1330].eqiad.wmnet [14:34:15] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2125-2127].codfw.wmnet [14:34:18] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2125-2127].codfw.wmnet [14:34:27] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2128-2130].codfw.wmnet [14:34:55] (03CR) 10Bking: [C:03+1] opensearch: move pki::get_cert call into profile module [puppet] - 10https://gerrit.wikimedia.org/r/1280788 (https://phabricator.wikimedia.org/T424204) (owner: 10Cwhite) [14:36:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:36:12] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2128-2130].codfw.wmnet [14:37:08] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:37:18] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:38:28] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1328-1330].eqiad.wmnet [14:38:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1328-1330].eqiad.wmnet [14:38:38] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1331-1333].eqiad.wmnet [14:40:19] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1331-1333].eqiad.wmnet [14:41:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::a6e1:1a00:1a6f:d3a3 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:42:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [14:43:12] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2128-2130].codfw.wmnet [14:43:15] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2128-2130].codfw.wmnet [14:43:24] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2131-2133].codfw.wmnet [14:43:42] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:44] !log Running queries to fixup data for T426002 [14:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:47] T426002: Set references delta to null for existing events and until proper computation logic exists - https://phabricator.wikimedia.org/T426002 [14:45:09] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2131-2133].codfw.wmnet [14:46:23] (03PS1) 10Btullis: [airflow-sre] Add a new cephfs for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 [14:47:04] (03CR) 10Andrew Bogott: [C:03+2] Magnum: refactor to allow both magnum-cluster-api and heat driver [puppet] - 10https://gerrit.wikimedia.org/r/1288858 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [14:47:07] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1331-1333].eqiad.wmnet [14:47:08] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1331-1333].eqiad.wmnet [14:47:13] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [14:47:17] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1334-1336].eqiad.wmnet [14:47:26] (03PS2) 10Btullis: [airflow-sre] Add a new cephfs PVC for data transfer purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288881 [14:48:04] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw [14:48:53] (03PS1) 10Filippo Giunchedi: alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) [14:48:57] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1334-1336].eqiad.wmnet [14:49:27] (03CR) 10CI reject: [V:04-1] alerts: Add optional pre-deploy transformations [puppet] - 10https://gerrit.wikimedia.org/r/1288883 (https://phabricator.wikimedia.org/T424814) (owner: 10Filippo Giunchedi) [14:52:24] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2131-2133].codfw.wmnet [14:52:26] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2131-2133].codfw.wmnet [14:52:35] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2134-2136].codfw.wmnet [14:52:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11931649 (10SLyngshede-WMF) [14:53:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1287947 (owner: 10Bking) [14:54:20] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2134-2136].codfw.wmnet [14:55:44] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1334-1336].eqiad.wmnet [14:55:46] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1334-1336].eqiad.wmnet [14:55:55] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1337-1339].eqiad.wmnet [14:57:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1337-1339].eqiad.wmnet [15:00:00] (03PS2) 10CWilliams: mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) [15:01:23] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [15:01:38] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2134-2136].codfw.wmnet [15:01:41] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2134-2136].codfw.wmnet [15:01:50] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2137-2139].codfw.wmnet [15:01:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1005.eqiad.wmnet [15:01:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11931742 (10herron) I don't think so ` 'kafka-logging[12]00[1-5]': - partman/standard.cfg - partman/hwraid-1dev.cfg 'kafka-logging*':... [15:02:04] (03PS1) 10Herron: kafka-logging: set new hosts to raid10-4dev [puppet] - 10https://gerrit.wikimedia.org/r/1288891 (https://phabricator.wikimedia.org/T418929) [15:03:35] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2137-2139].codfw.wmnet [15:04:08] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1337-1339].eqiad.wmnet [15:04:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1337-1339].eqiad.wmnet [15:04:18] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1340-1342].eqiad.wmnet [15:05:46] (03PS1) 10Brouberol: airflow-test-k8s: add the prometheus-pushgateway task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288894 (https://phabricator.wikimedia.org/T420691) [15:05:57] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1340-1342].eqiad.wmnet [15:06:16] (03PS2) 10Brouberol: airflow-test-k8s: add the prometheus-pushgateway task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288894 (https://phabricator.wikimedia.org/T420691) [15:06:48] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288894 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [15:07:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4003.ulsfo.wmnet [15:08:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1005.eqiad.wmnet [15:08:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2005.codfw.wmnet [15:09:06] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: add the prometheus-pushgateway task pod external services policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1288894 (https://phabricator.wikimedia.org/T420691) (owner: 10Brouberol) [15:10:50] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2137-2139].codfw.wmnet [15:10:52] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2137-2139].codfw.wmnet [15:10:59] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1340-1342].eqiad.wmnet [15:11:01] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1340-1342].eqiad.wmnet [15:11:01] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2140-2142].codfw.wmnet [15:11:02] !log herron@cumin1003 END (FAIL) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=99) rolling reboot on A:kafka-logging-codfw [15:11:09] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1343-1345].eqiad.wmnet [15:11:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4003.ulsfo.wmnet [15:12:46] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2140-2142].codfw.wmnet [15:12:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1343-1345].eqiad.wmnet [15:16:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2005.codfw.wmnet [15:16:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1006.eqiad.wmnet [15:17:49] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1343-1345].eqiad.wmnet [15:17:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1343-1345].eqiad.wmnet [15:18:00] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1346-1348].eqiad.wmnet [15:19:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3004.esams.wmnet [15:19:40] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1346-1348].eqiad.wmnet [15:20:04] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2140-2142].codfw.wmnet [15:20:06] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2140-2142].codfw.wmnet [15:20:15] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2143-2145].codfw.wmnet [15:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:01] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2143-2145].codfw.wmnet [15:23:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3004.esams.wmnet [15:25:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1006.eqiad.wmnet [15:25:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2006.codfw.wmnet [15:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:26:16] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1346-1348].eqiad.wmnet [15:26:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1346-1348].eqiad.wmnet [15:26:28] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1349-1351].eqiad.wmnet [15:28:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1349-1351].eqiad.wmnet [15:29:14] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2143-2145].codfw.wmnet [15:29:17] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2143-2145].codfw.wmnet [15:29:26] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2146-2148].codfw.wmnet [15:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1530). [15:31:11] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2146-2148].codfw.wmnet [15:31:59] (03PS1) 10Andrew Bogott: Magnum: re-enable heat driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1288904 [15:32:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2006.codfw.wmnet [15:32:37] (03PS2) 10Andrew Bogott: Magnum: re-enable heat driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1288904 (https://phabricator.wikimedia.org/T393782) [15:32:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1007.eqiad.wmnet [15:33:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [15:34:46] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1349-1351].eqiad.wmnet [15:34:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1349-1351].eqiad.wmnet [15:34:57] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1352-1354].eqiad.wmnet [15:35:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288904 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:35:09] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288904 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:36:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1352-1354].eqiad.wmnet [15:37:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [15:37:54] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1063.eqiad.wmnet with OS bookworm [15:37:56] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1064.eqiad.wmnet with OS bookworm [15:38:21] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2146-2148].codfw.wmnet [15:38:23] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2146-2148].codfw.wmnet [15:38:32] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2149-2151].codfw.wmnet [15:38:49] !log re-mapping thumbsize of 1 to 2 in all group0 wikis (T376152) [15:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [15:40:16] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2149-2151].codfw.wmnet [15:41:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1007.eqiad.wmnet [15:41:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2007.codfw.wmnet [15:41:10] (03CR) 10Andrew Bogott: [C:03+2] Magnum: re-enable heat driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1288904 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [15:43:13] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1352-1354].eqiad.wmnet [15:43:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1352-1354].eqiad.wmnet [15:43:23] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1355-1357].eqiad.wmnet [15:44:01] (03CR) 10Muehlenhoff: [C:03+2] data.yaml: Change cgoubert ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1288515 (owner: 10Clément Goubert) [15:44:16] (03CR) 10Muehlenhoff: [C:03+2] "Looks good and verified out of band, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1288515 (owner: 10Clément Goubert) [15:45:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1355-1357].eqiad.wmnet [15:46:01] (03CR) 10Xcollazo: "I suspect the puppet job timed out because we need to declare a `Host:` field?" [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [15:46:26] 10SRE-tools, 06Infrastructure-Foundations, 06SRE Observability: sre.kafka.roll-restart-reboot-brokers: command-config is not a recognized option - https://phabricator.wikimedia.org/T426639 (10herron) 03NEW [15:47:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2004.codfw.wmnet [15:47:32] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2149-2151].codfw.wmnet [15:48:27] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[2149-2151].codfw.wmnet [15:48:36] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2152-2154].codfw.wmnet [15:48:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2007.codfw.wmnet [15:48:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1008.eqiad.wmnet [15:49:39] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [15:49:53] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [15:50:22] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2152-2154].codfw.wmnet [15:50:47] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw [15:51:53] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1355-1357].eqiad.wmnet [15:51:55] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1355-1357].eqiad.wmnet [15:52:03] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1358-1360].eqiad.wmnet [15:52:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2004.codfw.wmnet [15:52:31] FIRING: [2x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:33] !log blake@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on P{wikikube-worker[2001-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [15:53:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1358-1360].eqiad.wmnet [15:54:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1063.eqiad.wmnet with reason: host reimage [15:54:21] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:54:21] PROBLEM - SSH on logstash1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:54:40] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2155-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [15:55:07] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2155-2169].codfw.wmnet [15:56:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1008.eqiad.wmnet [15:56:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2008.codfw.wmnet [15:57:16] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [15:57:31] FIRING: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:15] RECOVERY - SSH on logstash1032 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:00:52] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1358-1360].eqiad.wmnet [16:00:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1358-1360].eqiad.wmnet [16:01:02] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1361-1363].eqiad.wmnet [16:02:31] FIRING: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:43] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1361-1363].eqiad.wmnet [16:04:27] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2155-2169].codfw.wmnet [16:05:01] (03PS1) 10JHathaway: add txt verification record for yahoo [dns] - 10https://gerrit.wikimedia.org/r/1288909 (https://phabricator.wikimedia.org/T426105) [16:05:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2008.codfw.wmnet [16:05:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1009.eqiad.wmnet [16:07:13] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:09:18] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1361-1363].eqiad.wmnet [16:09:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1361-1363].eqiad.wmnet [16:09:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:28] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1364-1366].eqiad.wmnet [16:09:29] (03CR) 10Dreamy Jazz: add txt verification record for yahoo (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1288909 (https://phabricator.wikimedia.org/T426105) (owner: 10JHathaway) [16:10:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1063.eqiad.wmnet with OS bookworm [16:11:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1364-1366].eqiad.wmnet [16:12:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1064.eqiad.wmnet with OS bookworm [16:12:31] RESOLVED: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:02] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2155-2169].codfw.wmnet [16:13:07] (03CR) 10JHathaway: [C:03+1] "@cdanis@wikimedia.org looks good overall, any concerns with this running as root?" [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [16:13:11] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2155-2169].codfw.wmnet [16:13:40] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2170-2179,2184-2188].codfw.wmnet [16:14:16] (03CR) 10CDanis: [C:04-1] "Yeah actually especially given the usage of simdjson-go we should make this run as some other user, good point. I'll update the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1270971 (owner: 10CDanis) [16:14:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1009.eqiad.wmnet [16:14:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2009.codfw.wmnet [16:14:24] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:56] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1364-1366].eqiad.wmnet [16:18:58] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1364-1366].eqiad.wmnet [16:19:06] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1367-1369].eqiad.wmnet [16:20:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1367-1369].eqiad.wmnet [16:22:23] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2170-2179,2184-2188].codfw.wmnet [16:22:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2009.codfw.wmnet [16:22:59] !log contint1003 - rebooting [16:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:15] (03CR) 10JHathaway: add txt verification record for yahoo (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1288909 (https://phabricator.wikimedia.org/T426105) (owner: 10JHathaway) [16:25:51] PROBLEM - Host contint1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:17] !log people.wikimedia.org backend - rebooting [16:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:23] RECOVERY - Host contint1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:27:28] (03Abandoned) 10Ladsgroup: mysql: Allow for multiinstance clone [cookbooks] - 10https://gerrit.wikimedia.org/r/1099668 (owner: 10Ladsgroup) [16:27:47] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11932371 (10Raine) It appears that REST API redirects are... [16:28:11] (03CR) 10Dreamy Jazz: add txt verification record for yahoo (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1288909 (https://phabricator.wikimedia.org/T426105) (owner: 10JHathaway) [16:28:24] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1367-1369].eqiad.wmnet [16:28:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1367-1369].eqiad.wmnet [16:28:32] !log contint2003 - new jenkins - reboot for kernel upgrade [16:28:34] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1370-1372].eqiad.wmnet [16:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:45] PROBLEM - Host contint2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:13] RECOVERY - Host contint2003 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [16:31:25] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2170-2179,2184-2188].codfw.wmnet [16:31:34] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2170-2179,2184-2188].codfw.wmnet [16:31:57] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2189-2202].codfw.wmnet [16:32:52] !log zuul[12]00[123] / zuul* - rebooting [16:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:47] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1370-1372].eqiad.wmnet [16:37:36] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-codfw [16:40:41] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2189-2202].codfw.wmnet [16:41:22] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1370-1372].eqiad.wmnet [16:41:23] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1370-1372].eqiad.wmnet [16:41:31] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1373-1374].eqiad.wmnet [16:42:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1373-1374].eqiad.wmnet [16:44:22] (03PS1) 10Jasmine: kafka-main2006: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288917 (https://phabricator.wikimedia.org/T419216) [16:47:08] (03PS1) 10Jasmine: kafka-main2007: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288918 (https://phabricator.wikimedia.org/T419216) [16:48:28] (03PS1) 10Jasmine: kafka-main2008: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288919 (https://phabricator.wikimedia.org/T419216) [16:49:09] !log jiji@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1373-1374].eqiad.wmnet [16:49:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1373-1374].eqiad.wmnet [16:49:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[1328-1384].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:50:00] (03PS1) 10Jasmine: kafka-main2009: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288920 (https://phabricator.wikimedia.org/T419216) [16:51:39] (03PS1) 10Jasmine: kafka-main2010: apply host-level override in advance of trixie upgrade [0] [puppet] - 10https://gerrit.wikimedia.org/r/1288921 (https://phabricator.wikimedia.org/T419216) [16:54:46] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 3 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11932496 (10Ponor) For the record, a few more complaints on enwiki: [[https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#c-Obi2can... [16:55:52] herron@cumin1003 roll-restart-reboot-brokers (PID 747373) is awaiting input [16:57:09] (03CR) 10Marostegui: [C:03+1] mariadb: Decomission db2151 [puppet] - 10https://gerrit.wikimedia.org/r/1288875 (https://phabricator.wikimedia.org/T424343) (owner: 10CWilliams) [16:57:52] (03CR) 10Marostegui: [C:03+1] mariadb: Decommission db2150 [puppet] - 10https://gerrit.wikimedia.org/r/1288874 (https://phabricator.wikimedia.org/T424342) (owner: 10CWilliams) [16:59:01] (03CR) 10Bking: [C:03+2] bking: Replace my non-FIDO SSH key with a backup FIDO-backed key [puppet] - 10https://gerrit.wikimedia.org/r/1287947 (owner: 10Bking) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1700) [17:00:05] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1700). nyaa~ [17:01:29] PROBLEM - Host wikikube-worker2190 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:03] FIRING: [10x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:03:59] PROBLEM - Host phab2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:35] RECOVERY - Host phab2002 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [17:04:50] !log contint2002, phab2002 - rebooting [17:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:13] FIRING: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:33] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-eqiad [17:06:45] PROBLEM - Host contint2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:56] (03PS1) 10Ebernhardson: Include xff in search logs [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288924 (https://phabricator.wikimedia.org/T407432) [17:07:13] RECOVERY - Host contint2002 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms [17:07:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288924 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [17:07:42] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host o11ytest1001.eqiad.wmnet [17:07:53] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host o11ytest2001.codfw.wmnet [17:08:14] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwlog2003.codfw.wmnet [17:10:07] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [17:10:13] RESOLVED: [2x] ProbeDown: Service phab2002:25 has failed probes (tcp_phabricator_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#phab2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:28] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host arclamp2001.codfw.wmnet [17:10:33] PROBLEM - Swift https backend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [17:10:33] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [17:10:33] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [17:10:41] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:10:46] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1004.eqiad.wmnet with reason: T426563 [17:10:55] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [17:10:55] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [17:10:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:09] !log etherpad - rebooting backends [17:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:13] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [17:11:13] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/Swift [17:11:23] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:11:33] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [17:11:33] PROBLEM - Swift https frontend on ms-fe2023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [17:11:33] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [17:11:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2021.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:11:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2023.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2022.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:11:35] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1062.eqiad.wmnet with OS bookworm [17:11:40] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1061.eqiad.wmnet with OS bookworm [17:11:41] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:11:41] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:11:57] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 4.312 second response time https://wikitech.wikimedia.org/wiki/Swift [17:12:00] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host o11ytest1001.eqiad.wmnet [17:12:01] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 7.818 second response time https://wikitech.wikimedia.org/wiki/Swift [17:12:13] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [17:12:33] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 1.810 second response time https://wikitech.wikimedia.org/wiki/Swift [17:12:39] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 7.270 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:09] !log restarted gnmic on netflow3004 as series missing for cr2-esams [17:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:13] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:19] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 6.038 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:31] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:31] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:13:37] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 574 bytes in 5.868 second response time https://wikitech.wikimedia.org/wiki/Swift [17:13:41] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:13:48] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1004.eqiad.wmnet with reason: T426563 [17:13:50] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host o11ytest2001.codfw.wmnet [17:13:54] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [17:14:16] !log doc.wikimedia.org - rebooting backends [17:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:18] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc2003.codfw.wmnet with reason: T426563 [17:14:31] RECOVERY - Swift https frontend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [17:14:31] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [17:14:31] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [17:14:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:14:41] PROBLEM - Swift https frontend on ms-fe2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:15:03] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:15:19] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2003.codfw.wmnet [17:15:31] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [17:15:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [17:15:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:15:53] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [17:15:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:16:03] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:16:15] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp2001.codfw.wmnet [17:16:31] RECOVERY - Swift https backend on ms-fe2023 is OK: HTTP OK: HTTP/1.1 200 OK - 572 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [17:16:41] RECOVERY - Swift https frontend on ms-fe2021 is OK: HTTP OK: HTTP/1.1 200 OK - 363 bytes in 9.055 second response time https://wikitech.wikimedia.org/wiki/Swift [17:16:45] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [17:16:53] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 360 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [17:16:54] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [17:18:32] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host mwlog1003.eqiad.wmnet [17:20:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, ... [17:20:51] 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=codfw+prometheus%2Fops&var-device=cr1-codfw:9804&var-interface=xe-1%2F1%2F1%3A0 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:21:00] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [17:22:23] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [17:23:16] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [17:23:16] (03CR) 10RLazarus: "FYI the tests at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/httpbb/app" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [17:23:24] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [17:23:30] (03PS2) 10Seddon: Rolling back the MediaProvenance strings from commons due to iOS app crashes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 [17:23:31] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [17:24:09] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 3 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11932687 (10jhathaway) All mail to yahoo.com is currently being deferred, the current message is: ` 421 4.7.0 [TSS04] Messages from 208.80.154.5... [17:25:11] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1003.eqiad.wmnet [17:27:32] (03CR) 10Ladsgroup: [C:04-2] "It can't be merged yet. I'm still updating values" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287441 (https://phabricator.wikimedia.org/T426328) (owner: 10Jdlrobson) [17:27:51] FIRING: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:28:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [17:30:34] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet [17:30:46] jouncebot: nowandnext [17:30:46] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T1700) [17:30:47] In 2 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T2000) [17:31:32] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [17:32:07] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [17:32:51] RESOLVED: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [17:32:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [17:36:09] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [17:37:10] !incidents [17:37:51] !log stewards* - rebooting [17:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:25] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1005.eqiad.wmnet [17:38:33] (03PS1) 10Ladsgroup: swift: Insert the auth file on all frontend hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) [17:39:50] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) (owner: 10Ladsgroup) [17:40:01] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [17:43:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1062.eqiad.wmnet with OS bookworm [17:44:12] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [17:44:42] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [17:44:43] !log herron@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [17:44:47] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet [17:45:28] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [17:45:29] !log herron@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [17:45:58] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2003.codfw.wmnet with reason: T426563 [17:46:17] !log rebooting alert2002 [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:24] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1005.eqiad.wmnet with reason: T426563 [17:48:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:48:37] (03PS2) 10Ladsgroup: swift: Insert the auth file on all frontend hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) [17:48:57] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1061.eqiad.wmnet with OS bookworm [17:50:15] (03PS3) 10Ladsgroup: swift: Insert the auth file on all frontend hosts [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) [17:50:37] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1002.eqiad.wmnet [17:50:45] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [17:51:13] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1288929 (https://phabricator.wikimedia.org/T379942) (owner: 10Ladsgroup) [17:51:42] 10ops-codfw, 06DC-Ops: Too low optic power on - pfw1-codfw:xe-7/2/0 (Core: cr2-codfw:xe-0/0/1:0 {#122503}) - https://phabricator.wikimedia.org/T426671 (10ayounsi) 03NEW p:05Triage→03Medium [17:53:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 10.192.16.35 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:56:41] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-eqiad [17:56:56] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1002.eqiad.wmnet [17:58:21] FIRING: SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [17:59:29] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [17:59:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, ... [17:59:51] MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F0%2F6 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:01:55] PROBLEM - Host logging-hd2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:30] !log Deployed patch for T426631 [18:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:55] PROBLEM - Host logging-sd2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:21] FIRING: [3x] SLOBudgetBurn: Search update lag is below 95% target in eqiad - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:03:47] RECOVERY - Host logging-hd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [18:04:17] !log blake@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[2155-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [18:04:23] RECOVERY - Host logging-sd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.55 ms [18:06:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [18:06:38] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet [18:07:14] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet [18:08:21] FIRING: [5x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:11:15] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet [18:11:31] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [18:13:21] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:13:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet [18:13:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2002.codfw.wmnet [18:13:55] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1003.eqiad.wmnet [18:14:35] PROBLEM - Host logging-hd2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:41] PROBLEM - Host logging-sd2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, ... [18:14:51] MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/throughput-network-device-interfaces?var-site=eqiad+prometheus%2Fops&var-device=cr1-eqiad:9804&var-interface=xe-3%2F0%2F6 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:15:13] (03CR) 10Bartosz Dziewoński: [C:03+1] "This basically reverts https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1269441, related to T414338." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 (owner: 10Seddon) [18:16:25] RECOVERY - Host logging-sd2002 is UP: PING OK - Packet loss = 0%, RTA = 32.99 ms [18:16:54] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on releases2003.codfw.wmnet with reason: T426563 [18:16:58] !log releases.wikimedia.org - rebooting backends [18:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:15] RECOVERY - Host logging-hd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [18:18:54] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.07 ms [18:19:02] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2189.codfw.wmnet [18:19:03] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2189.codfw.wmnet [18:20:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2002.codfw.wmnet [18:20:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2003.codfw.wmnet [18:22:34] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2191-2202].codfw.wmnet [18:22:42] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2191-2202].codfw.wmnet [18:23:23] !incidents [18:23:23] 7993 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [18:23:23] 7992 (RESOLVED) [3x] TransitPeeringTransportOutSaturation network sre (gnmi) [18:23:23] 7991 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [18:23:23] 7990 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:24] 7978 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [18:23:24] 7980 (RESOLVED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [18:23:24] 7988 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:24] 7987 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:23:24] 7986 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:23:25] 7985 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:25] 7984 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:23:26] 7981 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:23:27] 7983 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:23:27] 7982 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:27] 7979 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:28] 7977 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:23:28] 7974 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [18:23:29] 7976 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:29] 7975 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [18:23:30] 7973 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [18:23:55] PROBLEM - Host logging-hd2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:55] PROBLEM - Host logging-sd2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:17] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:23] RECOVERY - Host logging-hd2003 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [18:25:25] RECOVERY - Host logging-sd2003 is UP: PING OK - Packet loss = 0%, RTA = 32.01 ms [18:26:34] !log swfrench@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2203-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [18:26:47] !log rebooting alert1002 [18:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:55] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:57] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [18:30:58] !log herron@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host centrallog1002.eqiad.wmnet [18:31:34] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [18:31:36] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab-runner2004.codfw.wmnet [18:32:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [18:32:51] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab-runner2004.codfw.wmnet [18:35:06] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:35:51] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2203-2215,2242].codfw.wmnet [18:38:21] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [18:38:55] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab2003.wikimedia.org with reason: T426563 [18:41:33] (03PS1) 10Gergő Tisza: Add CommonsFinder to $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288953 (https://phabricator.wikimedia.org/T426614) [18:43:01] RECOVERY - Host gitlab-runner2004 is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [18:44:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:55] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2203-2215,2242].codfw.wmnet [18:45:05] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2203-2215,2242].codfw.wmnet [18:45:30] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2243,2248-2260].codfw.wmnet [18:45:55] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:02] FIRING: [6x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:46:07] (03CR) 10JHathaway: [C:03+2] add txt verification record for yahoo [dns] - 10https://gerrit.wikimedia.org/r/1288909 (https://phabricator.wikimedia.org/T426105) (owner: 10JHathaway) [18:46:42] !log jhathaway@dns1004 START - running authdns-update [18:48:21] !log jhathaway@dns1004 END - running authdns-update [18:49:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:16] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2243,2248-2260].codfw.wmnet [19:02:24] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2243,2248-2260].codfw.wmnet [19:02:33] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2243,2248-2260].codfw.wmnet [19:03:04] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2261-2274].codfw.wmnet [19:06:02] FIRING: [4x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:12:22] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2261-2274].codfw.wmnet [19:20:40] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2261-2274].codfw.wmnet [19:20:49] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2261-2274].codfw.wmnet [19:21:02] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:21:13] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2275-2288].codfw.wmnet [19:21:45] (03PS3) 10Krinkle: Revert "Enable wgTrackMediaRequestProvenance on Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 (https://phabricator.wikimedia.org/T414338) (owner: 10Seddon) [19:23:32] 10ops-codfw, 06SRE, 06DC-Ops: Too low optic power on - pfw1-codfw:xe-7/2/0 (Core: cr2-codfw:xe-0/0/1:0 {#122503}) - https://phabricator.wikimedia.org/T426671#11933193 (10cmooney) This shows it went bad on Sept 22nd last year, but I suspect that this was when the old firewalls were replaced, so it's always be... [19:23:39] (03PS4) 10Krinkle: Revert "Enable wgTrackMediaRequestProvenance on Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 (https://phabricator.wikimedia.org/T414338) (owner: 10Seddon) [19:29:54] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2275-2288].codfw.wmnet [19:34:46] 10ops-codfw, 06DC-Ops, 06ServiceOps new: wikikube-worker2190.codfw.wmnet failure at reboot - https://phabricator.wikimedia.org/T426683 (10Scott_French) 03NEW [19:35:15] PROBLEM - Host logstash2033 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:43] RECOVERY - Host logstash2033 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [19:36:08] (03CR) 10Herron: [C:03+1] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [19:39:42] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2275-2288].codfw.wmnet [19:39:51] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2275-2288].codfw.wmnet [19:40:16] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2289-2302].codfw.wmnet [19:41:02] FIRING: [4x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:41:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:34] (03CR) 10Ladsgroup: "it might sound weird, but we still serve 404. I think I should eventually fix that but very likely (fingers crossed) it won't affect the t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [19:46:20] (03CR) 10RLazarus: "Oh yeah, of course! Never mind, thanks. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288274 (https://phabricator.wikimedia.org/T129433) (owner: 10Ladsgroup) [19:48:27] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2289-2302].codfw.wmnet [19:49:33] (03CR) 10Bartosz Dziewoński: [C:03+1] Add CommonsFinder to $wgUrlProtocols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288953 (https://phabricator.wikimedia.org/T426614) (owner: 10Gergő Tisza) [19:50:34] 10ops-esams, 06SRE, 06Commons, 06DC-Ops, and 3 others: ESAMS and others serving older revisions of overwritten files - https://phabricator.wikimedia.org/T425216#11933315 (10AlexisJazz) https://commons.wikimedia.org/w/index.php?title=Commons:Help_desk&diff=prev&oldid=1216270150 was quite possibly also this. [19:56:02] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2289-2302].codfw.wmnet [19:56:02] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:56:10] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2289-2302].codfw.wmnet [19:56:17] FIRING: [5x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:56:37] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2303-2316].codfw.wmnet [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T2000). [20:00:05] ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] only one, thats new :) I can ship it [20:00:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288924 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:02:13] PROBLEM - Host logstash2034 is DOWN: PING CRITICAL - Packet loss = 100% [20:02:43] RECOVERY - Host logstash2034 is UP: PING OK - Packet loss = 0%, RTA = 32.94 ms [20:03:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:05:06] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:05:20] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2303-2316].codfw.wmnet [20:07:39] PROBLEM - Host logstash2035 is DOWN: PING CRITICAL - Packet loss = 100% [20:08:21] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:09:43] RECOVERY - Host logstash2035 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [20:10:00] (03Merged) 10jenkins-bot: Include xff in search logs [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288924 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:10:06] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:12:41] PROBLEM - Host logstash2037 is DOWN: PING CRITICAL - Packet loss = 100% [20:13:25] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2303-2316].codfw.wmnet [20:13:34] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2303-2316].codfw.wmnet [20:14:00] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2317-2330].codfw.wmnet [20:14:20] Bleh. [20:14:36] spiderpig found un-deployed change in .2 from: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288871 [20:14:43] RECOVERY - Host logstash2037 is UP: PING OK - Packet loss = 0%, RTA = 33.04 ms [20:14:50] ebernhardson: Normal protocol is to abort deploys until the person that broke prod fixes it. There's a patch scheduled for tomorrow to (?) fix it. [20:14:59] But waiting until tomorrow feels bad. [20:15:18] seems reasonable, this isn't a huge rush it's collecting additional debug data to understand why something didn't work like it's supposed to. I can wait for tomorrow i spupose [20:15:26] :-( [20:16:02] FIRING: [3x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:16:04] it doesn't look like spiderpig reverts the patch on cancel? [20:16:21] No. [20:16:27] ok, will do myself [20:16:30] <3 [20:16:33] I'll complain. [20:16:39] thanks! [20:16:57] (03PS1) 10Ebernhardson: Revert "Include xff in search logs" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288976 [20:17:07] (03CR) 10Ebernhardson: [C:03+2] Revert "Include xff in search logs" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288976 (owner: 10Ebernhardson) [20:17:41] (03CR) 10Jforrester: "This is not good; leaving production in (apparently) an undeployable state (presumably given the comment on the original squash patch, hen" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [20:22:50] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2317-2330].codfw.wmnet [20:28:00] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog: Investigate Code 414 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11933410 (10cooltey) [20:28:21] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:30:06] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:30:16] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2317-2330].codfw.wmnet [20:30:25] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2317-2330].codfw.wmnet [20:30:26] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2203-2331].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [20:31:02] FIRING: [4x] KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:32:15] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [20:32:43] RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [20:33:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:34:13] (03Merged) 10jenkins-bot: Revert "Include xff in search logs" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288976 (owner: 10Ebernhardson) [20:35:06] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:35:39] ebernhardson: Let me know when you're done, I've got a late entry config change to (possibly) unbreak POTD in Wikipedia mobile app [20:35:46] Krinkle: i'm done [20:36:07] https://gerrit.wikimedia.org/r/1288976 just merged a second ago. Is that not being deployed? [20:36:10] Oh I see that's a revert. My bad. [20:36:22] Okay. [20:36:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 (https://phabricator.wikimedia.org/T414338) (owner: 10Seddon) [20:37:49] PROBLEM - Host logging-hd1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:21] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:39:01] PROBLEM - Host logging-sd1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:39:49] RECOVERY - Host logging-sd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [20:40:42] RECOVERY - Host logging-hd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:41:21] (03Merged) 10jenkins-bot: Revert "Enable wgTrackMediaRequestProvenance on Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288925 (https://phabricator.wikimedia.org/T414338) (owner: 10Seddon) [20:43:21] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:44:50] ebernhardson: eh. so what's the state of prod right now? I'm of course also pulling in undeployed changes right now. [20:45:03] Krinkle: doh, i should have mentioned ya wmf.2 isn't deployable [20:45:12] config changes go to all versions [20:45:23] we can't revert the undeployed patch as a no-op? [20:45:30] what is the undeployed patch [20:45:51] I'm not entirely sure, possibly? The patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288871 [20:46:11] i suppose i thought config would only pull mw-config like when manually deploying, hadn't looked at that yet [20:46:35] (03PS1) 10Ebernhardson: Revert^2 "Include xff in search logs" [extensions/CirrusSearch] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288983 [20:46:39] yeah, not anymore since the switch from baremetal to kubernetes [20:46:51] we now build a complete docker image from wmf branches + config + security patches [20:47:00] and then ask deplouyer to approve the diff [20:47:19] for me that diff includes: [20:47:19] +++ b/extensions/ReaderExperiments/resources/experiments/shareHighlight/components/ShareQuoteDialog.vue [20:47:19] - const summaryTitle = computed( () => ( props.open && needsSummary.value ? props.title : null ) ); [20:47:19] + const summaryTitle = computed( () => ( needsSummary.value ? props.title : null ) ); [20:48:15] Which suggests, at this point in time, the undeployed part is removing `props.open` from that line. This is what https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288871 does which is a "revert". [20:48:16] hmm, i hadn't got anything from spiderpig so ran `git log HEAD..origin/wmf/1.47.0-wmf.2` from php-1.47.0-wmf.2 to get that [20:49:07] This implies that the squash was deployed, but I'm guessing that's not actually true, and instead Scap is basing this diff based on the last attempted deployment, not the last completed deployment. I don't know that for sure, but I'll see if I can verify what's actually deployed via Special:Version and then confirm via mwdebug/mwexperimental [20:49:36] https://en.wikipedia.org/wiki/Special:Version ReaderExperiments – (beaa2fe) "Squashed diff to master" suggests it was in fact deployed. [20:50:06] FIRING: [15x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:50:38] https://sal.toolforge.org/production?p=0&q=mlitn&d= says: [20:50:49] PROBLEM - SSH on logging-hd1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:51:13] PROBLEM - Host logging-sd1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:47] * 14:02 Started scap sync-world: Backport for [[gerrit:1288504|Squashed diff to master]] [20:51:47] * 14:05 Backport … synced to the testservers … Changes can now be verified there. [20:51:47] * 14:08 Continuing with deployment [20:51:47] * 14:14 Rolling back deployment [20:51:47] * 14:22 Finished scap sync-world [20:52:16] That's a confusing timeline. Does this mean scap "finished" the rollback? It doesn't look like it. It finished the actual deployment. [20:52:39] RECOVERY - SSH on logging-hd1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:52:41] RECOVERY - Host logging-sd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [20:52:52] yea i tried to look over it but also found it confusing what exactly was going on. [20:52:54] I'm guessing one of those was not written by scap but a manual !log [20:52:59] * Krinkle checks IRC log [20:53:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:54:08] !log mlitn@deploy1003 Rolling back deployment [20:54:08] ^ Logstash checker failures. I don't think they're related, but given we're already outside deployment window, I'm rolling back [20:54:37] OK, so definitly from scap. Weird. Well, whatever the case, it was in fact not rolled back. And we now have an undeployed "rollback" patch that was merged but not deployed. [20:54:48] I'll verify on mw-experimental to be sure but I think that's what we've got. [20:55:06] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:57:04] I've gotta run, but it seems like you have a handle on this [20:57:08] thx [20:59:41] So my options now are: 1) Say Yes to scap, which will deploy the rollback that Matthias thought he completed and thus restores status quo from before today's botched deployment, but it does mean I'm making a change in ReaderExperiments. Or, 2) Prepare a revert-revert that re-applies the today's "Squashed diff" patch, so that the wmf branch matches production. Then merge and deploy that revert-revert which will be a no-op because it [20:59:42] is already live in production. I don't have to actually prepare that patch because Matthias has done so already https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288877 with the intention to deploy that tomorrow, because he believed it was rollback and wants to re-deploy it, even though it was never rolled back. Or 3) Post-pone unbreaking the iOS app and leave this mess overnight. [21:00:04] alexsanford, Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T2100). [21:02:20] ``` [21:02:20] krinkle@wikikube-worker-exp1001:~$ fgrep summaryTitle /srv/mediawiki/php-1.47.0-wmf.2/extensions/ReaderExperiments/resources/experiments/shareHighlight/components/ShareQuoteDialog.vue [21:02:20] const summaryTitle = computed( () => ( props.open && needsSummary.value ? props.title : null ) ); [21:02:20] ``` [21:02:39] `krinkle@wikikube-worker-exp1001:~$ fgrep summaryTitle /srv/mediawiki/php-1.47.0-wmf.2/extensions/ReaderExperiments/resources/experiments/shareHighlight/components/ShareQuoteDialog.vue const summaryTitle = computed( () => ( props.open && needsSummary.value ? props.title : null ) );` [21:03:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:03:41] and https://en.wikipedia.org/w/extensions/ReaderExperiments/resources/experiments/shareHighlight/init.js indeed shows the content of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288877/1/resources/experiments/shareHighlight/init.js including ` shareButton.style.display = 'none';` [21:04:17] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: wikikube-worker2190 System Configuration Check error - https://phabricator.wikimedia.org/T423175#11933480 (10Jhancock.wm) 05Open→03Declined got forced on another ticket. no need for this one [21:04:17] RECOVERY - Host wikikube-worker2190 is UP: PING OK - Packet loss = 0%, RTA = 33.04 ms [21:04:41] Also, to scap's credit, it was correctlky diffing against what's really in prod. Nice. [21:05:03] Sorry for doubting you, Scap (and dancy, et all). Great work. [21:05:15] OK, I think that's as confident as we can be that the patch is in fact deployed and so option 2 makes the most sense. Deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1288877 which is a no-op to sync wmf with what's actually deployed. [21:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:08:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:09:11] PROBLEM - Host logging-sd1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:09:13] PROBLEM - Host logging-hd1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:09:59] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit1003.wikimedia.org with reason: T426563 [21:10:06] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:10:08] (03CR) 10CI reject: [V:04-1] Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:10:24] !log gerrit-replica.wikimedia.org, gerrit-spare.wikimedia.org - rebooting backends [21:10:25] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2190.codfw.wmnet failure at reboot - https://phabricator.wikimedia.org/T426683#11933487 (10Jhancock.wm) a:03Jhancock.wm HWC8010: The System Configuration Check operation resulted in the following issue: Comm Error: Backplane 0. UEFI011... [21:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:37] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit2002.wikimedia.org with reason: T426563 [21:10:42] RECOVERY - Host logging-sd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:10:42] RECOVERY - Host logging-hd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:11:02] RESOLVED: KubernetesCalicoDown: wikikube-worker2190.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2190.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:13:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:14:18] !log dzahn@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:15:00 on gerrit-replica.wikimedia.org with reason: T426563 [21:14:49] (03CR) 10JHathaway: [C:03+1] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [21:15:40] (03CR) 10Krinkle: "TLDR: This no-op now that syncs wmf branch with what's already deployed. The rollback was in fact never deployed, so patch which is intend" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:16:31] !log gerrit-replica.wikimedia.org back online [21:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:39] (03CR) 10Krinkle: [C:03+2] "Flaky CI?" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:17:32] (03CR) 10Dzahn: "Just had to reboot backends of gerrit-replica.wikimedia.org - it's possible this has effected you, sorry for the inconvenience" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:18:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:18:22] (03CR) 10Dzahn: [C:03+1] corto: set default visibility to WMF-NDA [puppet] - 10https://gerrit.wikimedia.org/r/1287424 (https://phabricator.wikimedia.org/T426137) (owner: 10Hnowlan) [21:19:06] (03CR) 10Dzahn: "recheck" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:20:06] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:23:34] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288877 (owner: 10Matthias Mullie) [21:24:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11933539 (10VRiley-WMF) @BCornwall is this okay to assign to you? [21:27:07] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2190.codfw.wmnet failure at reboot - https://phabricator.wikimedia.org/T426683#11933545 (10Scott_French) Thank you very much, @Jhancock.wm! Looks good - feel free to close this out. I'll repool the host shortly. [21:27:37] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2190.codfw.wmnet failure at reboot - https://phabricator.wikimedia.org/T426683#11933546 (10Jhancock.wm) 05Open→03Resolved [21:27:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Revert lvs1017 Mellanox NIC to Broadcom - https://phabricator.wikimedia.org/T421421#11933547 (10BCornwall) a:05VRiley-WMF→03BCornwall Sure thing! [21:27:46] I'm waiting for the automatic submodule to happen... [21:28:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:28:21] OK, let's retry. The diff should now be empty [21:28:25] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2190.codfw.wmnet [21:28:26] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2190.codfw.wmnet [21:28:47] It is not an empty diff [21:29:23] But it is now the reverse diff. It asks whether I want to add this line: [21:29:55] +++ b/extensions/ReaderExperiments/resources/experiments/shareHighlight/components/ShareQuoteDialog.vue [21:29:55] + const summaryTitle = computed( () => ( props.open && needsSummary.value ? props.title : null ) ); [21:29:55] +++ b/extensions/ReaderExperiments/resources/experiments/shareHighlight/init.js [21:29:55] + shareButton.style.display = 'none'; [21:30:06] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:30:09] So we've managed to confuse Scap into no longer having a hold on what's deployed. [21:30:21] but this is indeed the lines we know are deployed and were never rolled back [21:30:44] Which I confirm once more by loading https://en.wikipedia.org/w/extensions/ReaderExperiments/resources/experiments/shareHighlight/init.js and by grepping mw-experiments [21:31:09] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] [21:31:14] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [21:31:15] T425580: [Spike] [BUG] POTD Gallery doesn't load, crashes upon share - https://phabricator.wikimedia.org/T425580 [21:31:34] Hey Krinkle - just FYI, another patch for T426631 will be riding the sync-world you just started [21:31:46] for ext:timeline [21:32:13] PROBLEM - Host logging-hd1004 is DOWN: PING CRITICAL - Packet loss = 100% [21:32:13] PROBLEM - Host logging-sd1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:32:42] RECOVERY - Host logging-sd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [21:32:42] RECOVERY - Host logging-hd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [21:32:56] !log krinkle@deploy1003 seddon, krinkle: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:21] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:34:40] sbassett: OK. Do you want to test it on staging? [21:35:06] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:35:13] Krinkle: no, it’s really just a minor follow-up patch that should be pretty low-risk. Just additional hardening. [21:36:02] I’m going to monitor mediawiki-errors, but that’s really all I’d be concerned about. And I think it’s very low probability that it would introduce a serious bug. [21:38:18] !log krinkle@deploy1003 seddon, krinkle: Continuing with deployment [21:38:21] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:40:58] Krinkle: I just noticed my earlier "rolled back" (or so I thought) deployment messed things up? [21:41:18] Did I get it right you got things sorted out already? Or is there anything I can help with? [21:41:33] still in the middle of it, mid deploy right now [21:42:39] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] (duration: 11m 29s) [21:42:44] T414338: FY25-26 WE5.4.12: Identify the provenance of image requests - https://phabricator.wikimedia.org/T414338 [21:42:44] T425580: [Spike] [BUG] POTD Gallery doesn't load, crashes upon share - https://phabricator.wikimedia.org/T425580 [21:42:56] matthiasmullie: TLDR: Your deployment was completed and not actually rolled back. So the "revert" that you merged in Gerrit, did not sync wmf branch with prod, but rather an undeployed patch that was merged in Gerrit but not deployed. [21:43:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:43:22] matthiasmullie: so when Erik and I separately tried to deploy something, we were asked by Scap whether we want to deploy your revert, and it was unclear what to do there. [21:43:23] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in2001.wikimedia.org with reason: T426563 [21:43:50] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out2001.wikimedia.org with reason: T426563 [21:44:01] I am now deploying the patch you scheduled for tomorrow, because that will be a no-op, syncing wmf branch in Gerrit with what's already in prod. [21:44:33] If you know the patch to have been rolled back, do let me know. I don't know the actual functionality or anything, just looked as best I can to see if it is deployed or not. [21:44:58] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in1001.wikimedia.org with reason: T426563 [21:45:21] Deployed or rolled back doesn't matter; either works, and I'll have to follow things up tomorrow anyway [21:45:22] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out1001.wikimedia.org with reason: T426563 [21:46:32] It may be worth looking through the IRC and scap logs to figure out what went wrong. I'm asuming you pressed "n" on the staging step in scap and were (rightfully) under the impression that it rolled back your deployment. I don't know why it ended up deployed. [21:46:45] Just to make sure I didn't mess up, and/or know what to do next time: scap told me there was (afaict unrelated) logspam; I told it to rollback; it told me it did, and that I needed to merge a revert (which I didn't deploy) - is that correct, or did I misunderstand something there? [21:47:00] Nope, that all makes sense to me. [21:47:32] https://sal.toolforge.org/production?p=0&q=mlitn&d= [21:47:47] Scap output is here: https://spiderpig.wikimedia.org/jobs/2020 [21:47:51] Notice there that ten minutes after it acknowledged your rollback intent, it says it finihsed deploying it anyway [21:48:32] Kubernetes deployment summary: [21:48:33] - testservers: rolled-back [21:48:33] - canaries: rolled-back [21:48:33] - production: rolled-back [21:48:33] Result: rolled back [21:49:15] Yeah, 10 minutes sounds about right - took a couple of minutes to "roll back", then a couple more to get the revert merged before I released the lock [21:49:38] I assumed the "Finished scap sync-world" meant "finished backport" [21:49:39] odd :p [21:49:47] finished rollback* rather [21:49:49] Yeah, that seems reasonable. [21:49:56] https://sal.toolforge.org/production?p=0&q=scap&d= [21:50:06] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:50:22] I see no other scap start between yours and mine, but maybe something else ended up promoting the dockerimage you built. [21:50:41] Anyway I'd say reach out to RelEng folks and have them investigate. Might be a bug in spiderpig, or maybe something else promoted it out of bound. [21:50:53] OK. my deploy is done. [21:51:03] sbassett: feel free to continue anything else you might have [21:51:08] otherwise, I'm off o/ [21:51:23] I'll ping them tomorrow. Thanks for cleaning up the mess, Krinkle [21:51:38] np, thanks for checking in. [21:53:21] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:54:30] Yep, that’s the only sec patch I’m going to attempt today, thanks. [21:54:58] !log swfrench@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2149-2154].codfw.wmnet [21:55:02] !log swfrench@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2149-2154].codfw.wmnet [21:55:06] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [21:57:36] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1288994 [21:57:41] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dsantamaria - https://phabricator.wikimedia.org/T426561#11933674 (10thcipriani) >>! In T426561#11929503, @SLyngshede-WMF wrote: > @thcipriani for your approval. Approved. @DSantamaria if you need deployment for MediaWiki backports, you'll also... [22:00:06] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:00:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:00:47] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:01:13] PROBLEM - Host logstash1033 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:42] RECOVERY - Host logstash1033 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [22:03:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:05:14] (03PS1) 10Kimberly Sarabia: Make image browsing available in Beta and TestWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) [22:05:45] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:05:47] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:06:07] !log swfrench@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker2331.codfw.wmnet [22:08:21] FIRING: [8x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:09:41] PROBLEM - Host logstash1034 is DOWN: PING CRITICAL - Packet loss = 100% [22:11:34] !log swfrench@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker2331.codfw.wmnet [22:11:41] RECOVERY - Host logstash1034 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:15:13] PROBLEM - Host logstash1035 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:42] RECOVERY - Host logstash1035 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [22:18:21] FIRING: [8x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:19:13] (03PS1) 10Kimberly Sarabia: Make image browsing available in Beta and TestWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) [22:20:29] (03CR) 10Kimberly Sarabia: "Please ignore the above. Just realized we still have the guard in Hooks.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288996 (https://phabricator.wikimedia.org/T421019) (owner: 10Kimberly Sarabia) [22:22:13] PROBLEM - Host logstash1036 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:38] (03PS1) 10SBassett: Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) [22:22:42] RECOVERY - Host logstash1036 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [22:23:24] (03CR) 10SBassett: [C:04-1] "Hold for config deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [22:23:36] (03PS6) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [22:24:28] (03PS2) 10SBassett: Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) [22:25:36] (03PS3) 10SBassett: Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) [22:25:52] (03CR) 10CI reject: [V:04-1] Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [22:27:13] PROBLEM - Host logstash1037 is DOWN: PING CRITICAL - Packet loss = 100% [22:27:33] (03PS7) 10Jdlrobson: Remove MinervaNightMode config after skin cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1285523 (https://phabricator.wikimedia.org/T426689) (owner: 10HakanIST) [22:27:42] RECOVERY - Host logstash1037 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [22:28:21] FIRING: [7x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:30:10] FIRING: [2x] SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:13] PROBLEM - Host logging-sd1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:33:42] RECOVERY - Host logging-sd1002 is UP: PING OK - Packet loss = 0%, RTA = 3.49 ms [22:34:43] 10ops-eqiad, 06SRE, 06DC-Ops: Work on storage room cleanup - https://phabricator.wikimedia.org/T423227#11933772 (10VRiley-WMF) 05Open→03Resolved Threw away cardboard and was able to sort a few things. [22:40:06] FIRING: [8x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:43:21] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:44:18] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11933806 (10jhathaway) Someone from Yahoo was kind enough to reach out to me directly and modify the IP reputation, so emails are flowing again! [22:45:06] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:46:24] FIRING: [42x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:21] FIRING: [9x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:48:21] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1059.eqiad.wmnet with OS bookworm [22:48:42] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1060.eqiad.wmnet with OS bookworm [22:50:06] FIRING: [10x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:53:21] FIRING: [11x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:55:06] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:56:13] (03PS1) 10Ladsgroup: Remove wgThumbnailStepsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289000 [22:57:54] (03CR) 10Jforrester: [C:03+1] Remove wgThumbnailStepsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289000 (owner: 10Ladsgroup) [22:58:21] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [22:59:56] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [23:00:05] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T2300) [23:00:24] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [23:03:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1059.eqiad.wmnet with reason: host reimage [23:06:47] (03PS6) 10Cwhite: logstash/thanos-qfe: add event.start [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [23:07:15] (03CR) 10Catrope: [C:03+1] Explicitly set wgCSPUseReportURIDirective and not wmgCSPUseReportURIDirective to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1288999 (https://phabricator.wikimedia.org/T424058) (owner: 10SBassett) [23:07:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [23:08:21] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:08:38] jouncebot: nowandnext [23:08:38] For the next 0 hour(s) and 51 minute(s): Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260518T2300) [23:08:38] In 2 hour(s) and 51 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260519T0200) [23:09:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289000 (owner: 10Ladsgroup) [23:09:24] (03CR) 10Cwhite: [C:03+1] logstash/thanos-qfe: add event.start (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1287827 (owner: 10Tiziano Fogli) [23:10:54] (03Merged) 10jenkins-bot: Remove wgThumbnailStepsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289000 (owner: 10Ladsgroup) [23:11:09] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1289000|Remove wgThumbnailStepsRatio]] [23:12:56] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1289000|Remove wgThumbnailStepsRatio]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:13:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:13:48] !log ladsgroup@deploy1003 ladsgroup: Continuing with deployment [23:18:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1289000|Remove wgThumbnailStepsRatio]] (duration: 06m 52s) [23:18:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:19:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1059.eqiad.wmnet with OS bookworm [23:20:06] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:23:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1060.eqiad.wmnet with OS bookworm [23:23:21] FIRING: [13x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:25:06] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:25:43] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11933904 (10Xaosflux) Thank you, I've sent a few emails, and some emailauth users have reported back that they are now getting their codes. [23:25:54] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 2 others: yahoo rejecting our emails - https://phabricator.wikimedia.org/T426105#11933906 (10Xaosflux) a:03jhathaway [23:28:21] FIRING: [14x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:37:57] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lists2001.wikimedia.org with reason: T426563 [23:38:21] FIRING: [12x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [23:39:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289003 [23:39:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289003 (owner: 10TrainBranchBot) [23:41:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:44:06] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on vrts1004.eqiad.wmnet with reason: T426563 [23:44:39] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host vrts1004.eqiad.wmnet [23:48:05] (03PS1) 10Jforrester: IS: Drop wgGraphDefaultVegaVer, never used any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289005 [23:48:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1004.eqiad.wmnet [23:52:35] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1289003 (owner: 10TrainBranchBot) [23:52:44] (03PS1) 10Jforrester: IS: Drop wgEnableSpecialMute, ignored since MW 1.46 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289006 [23:56:38] (03PS1) 10Jforrester: IS: Drop wgDiscussionTools_visualenhancements_*, ignored since 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289007 [23:57:34] (03PS1) 10Ladsgroup: ThumbLimits: Harmonize svwiki large size with the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289008 (https://phabricator.wikimedia.org/T376152) [23:58:23] (03CR) 10DLynch: [C:03+2] "Yup, these were merged into the remaining `DiscussionTools_visualenhancements` config in I10aa495e9fda4676fe2bf6592ce3a6802b8cf802." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289007 (owner: 10Jforrester) [23:58:48] Kemayo: Let's not C+2 prod config patches unless we're deploying them right now. :-) [23:59:33] (03CR) 10DLynch: [C:03+1] "Ahem. I mean." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1289007 (owner: 10Jforrester) [23:59:57] James_F: Oops, didn't occur to me which repo that was in.