[00:00:53] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2262 [00:01:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2262 [00:01:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:01:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:15:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1119.eqiad.wmnet with OS bullseye [00:15:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10672063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic11... [00:15:58] !log restarting varnishkafka on cp7009 [00:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:18] !log [correction]restarting varnishkafka-webrequest on cp7009 [00:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:25:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1119.eqiad.wmnet with reason: host reimage [00:28:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1119.eqiad.wmnet with reason: host reimage [00:35:05] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [00:38:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130763 [00:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130763 (owner: 10TrainBranchBot) [00:40:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2262 to codfw - jhancock@cumin2002" [00:41:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2262 to codfw - jhancock@cumin2002" [00:41:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:41:52] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2262 [00:41:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2272 [00:42:03] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2273 [00:42:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2274 [00:42:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2275 [00:42:14] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2276 [00:42:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2262 [00:42:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2273 [00:42:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2272 [00:42:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2275 [00:42:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2274 [00:42:32] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2275 [00:42:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2276 [00:42:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2273 [00:42:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2275 [00:42:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:42:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2273 [00:43:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:43:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1119.eqiad.wmnet with OS bullseye [00:43:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10672105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic1119.e... [00:43:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:46:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:46:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:47:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:51:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1130763 (owner: 10TrainBranchBot) [00:56:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:56:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:56:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:56:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:57:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:57:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:58:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:59:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:00:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:00:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:03:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:04:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:04:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:06:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:06:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:08:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130768 [01:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130768 (owner: 10TrainBranchBot) [01:14:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:18:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [01:18:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [01:18:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:24:11] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [01:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672139 (10phaultfinder) [01:28:41] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1130768 (owner: 10TrainBranchBot) [01:31:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [01:31:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2263 to codfw - jhancock@cumin2002" [01:31:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:32:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2263 [01:33:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2263 [01:33:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2264 [01:33:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2264 [01:33:35] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2265 [01:33:41] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2266 [01:33:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2265 [01:33:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2266 [01:33:52] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2267 [01:33:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2267 [01:34:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2268 [01:34:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2268 [01:34:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2269 [01:34:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2269 [01:35:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:36:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:36:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:36:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2265.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:37:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:38:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:38:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:38:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:39:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:39:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:41:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:48:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:49:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2265.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:50:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:50:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:51:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:52:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:58:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0200) [02:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672199 (10phaultfinder) [02:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.22 [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130769 (https://phabricator.wikimedia.org/T386217) [02:08:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.22 [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130769 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [02:20:58] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.22 [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130769 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [02:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672217 (10phaultfinder) [02:50:55] (03PS1) 10Jdlrobson: Web features should not be ambiguously configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0300) [03:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672267 (10phaultfinder) [03:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672292 (10phaultfinder) [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0400) [04:04:57] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.22, 1.43.0-wmf.23, 1.43.0-wmf.24, 1.43.0-wmf.25, 1.43.0-wmf.26, 1.43.0-wmf.27, 1.43.0-wmf.28, 1.44.0-wmf.1, 1.44.0-wmf.2, 1.44.0-wmf.3, 1.44.0-wmf.4, 1.44.0-wmf.5, 1.44.0-wmf.6, 1.44.0-wmf.8, 1.44.0-wmf.11, 1.44.0-wmf.12, 1.44.0-wmf.13, 1.44.0-wmf.14, 1.44.0-wmf.15, 1.44.0-wmf.16, 1.44.0-wmf.17, 1.44.0-wmf.18, 1.44.0-wmf.19 (duration: 04m 48s) [04:05:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672344 (10phaultfinder) [04:21:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [04:31:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [04:38:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [04:51:33] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [04:53:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [05:00:33] (03CR) 10Kevin Bazira: [C:03+2] changeprop: add liftwing revertrisk-language-agnostic stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130349 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [05:02:20] (03Merged) 10jenkins-bot: changeprop: add liftwing revertrisk-language-agnostic stream to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130349 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672397 (10phaultfinder) [05:24:28] (03PS3) 10KartikMistry: MinT: staging: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) [05:35:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672414 (10phaultfinder) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0600) [06:00:05] marostegui, Amir1, and federico3: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0600). nyaa~ [06:12:54] 10ops-codfw, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389913 (10phaultfinder) 03NEW [06:17:56] Deploying MinT on staging. [06:18:53] marostegui OK to do it? Primary DC switchover is listed in the Deployment window. [06:19:26] ah. That's DB, not DC. [06:20:05] (03CR) 10Brouberol: [C:03+2] Add airflow.discovery.wmnet to the airflow-main x509 SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130705 (owner: 10Brouberol) [06:20:16] kart_: you can go ahead [06:26:11] Thanks [06:26:47] (03CR) 10KartikMistry: [C:03+2] MinT: staging: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [06:28:11] (03Merged) 10jenkins-bot: MinT: staging: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128067 (https://phabricator.wikimedia.org/T386889) (owner: 10KartikMistry) [06:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672463 (10phaultfinder) [06:32:16] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:32:22] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:34:03] hmm. Not sure why diff not showing changes in `values-staging.yaml` [06:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672481 (10phaultfinder) [06:53:50] (03PS1) 10Marostegui: db1151,db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130882 (https://phabricator.wikimedia.org/T387332) [06:54:15] (03CR) 10Marostegui: [C:03+2] db1151,db2144: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1130882 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:00:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:00:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:19:35] Why I didn't see changes I did in: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1128067 in the `helmfile -e staging diff --context 5` - what am I missing here? [07:20:58] (03CR) 10Arnaudb: [C:03+1] gerrit: raise heap limit from 32g to 64g [puppet] - 10https://gerrit.wikimedia.org/r/1130597 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [07:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672533 (10phaultfinder) [07:42:46] good morning [07:44:07] the train presync failed over night [07:46:37] (03PS1) 10Muehlenhoff: Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) [07:47:00] (03CR) 10CI reject: [V:04-1] Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [07:54:08] (03PS2) 10Muehlenhoff: Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) [07:54:32] (03CR) 10CI reject: [V:04-1] Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [07:57:39] (03PS3) 10Muehlenhoff: Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) [07:59:08] abijeet: Are you self-deploying? If not, I'm happy to help. [07:59:49] awight, thanks. kart_ will help deploy [08:00:04] abijeet: Great, have fun! [08:00:04] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0800). [08:00:04] abijeet and hashar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:05] andre and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0800). [08:00:15] awight: abijeet please hold [08:00:23] ack [08:00:29] the train automatic tasks failed over night :) [08:00:38] let me check the state [08:00:41] kart_: ^ also FYI [08:02:15] yeah it is fine to proceed. The wmf.22 branch has been cut, simply the new Docker image has NOT been built [08:02:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10672585 (10MoritzMuehlenhoff) [08:02:43] and I am not sure what will happen with the first backport, hopefully that is not going to build the full image else that takes 40/50 minutes iirc [08:02:49] anyway, you can deploy [08:03:09] Thanks hashar [08:03:55] abijeet: deploying first change [08:04:09] ok [08:04:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [08:05:15] (03Merged) 10jenkins-bot: AX: Disable automatic translation entrypoints before release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130345 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [08:05:25] hashar: eh, thanks [08:05:29] !log Shifted MediaWiki train UTC-0 version window by one hour to avoid conflict with backport window. [08:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:32] jouncebot: refresh [08:05:33] I refreshed my knowledge about deployments. [08:05:35] jouncebot: now [08:05:35] For the next 0 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0800) [08:05:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672588 (10phaultfinder) [08:05:39] jouncebot: next [08:05:39] In 0 hour(s) and 54 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0900) [08:05:40] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1130345|AX: Disable automatic translation entrypoints before release (T389176)]] [08:05:42] \o/ [08:05:44] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [08:05:58] andre: good morning, I have moved the train one hour later (well back at the usual 10am time :b ) [08:06:04] and fixed the security patch that was faulty [08:06:20] I thought that might block the backport window :b [08:06:31] yeah I saw confusion Daylight Time Confusion around [08:06:44] uh thanks for the patch fix, hadn't checked yet but pinged on that task yesterday [08:07:41] the scap-presync job failed over night, and I don't think we have a way to run it (we lack the sudo permissions iirc) [08:08:45] !log installing jinja2 security updates [08:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:13] !log kartik@deploy1003 kartik, abi: Backport for [[gerrit:1130345|AX: Disable automatic translation entrypoints before release (T389176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:11:17] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [08:11:38] !log joal@deploy1003 Started deploy [analytics/refinery@2f09783]: Regular analytics weekly train [analytics/refinery@2f097836] [08:12:11] abijeet: possible to test on testserver? [08:12:56] kart_, I'm seeing a separate issue: https://phabricator.wikimedia.org/T389920 that blocks QA of this. [08:13:55] ah. Should we go ahead and fix or cancel + revert? [08:14:13] !log joal@deploy1003 Finished deploy [analytics/refinery@2f09783]: Regular analytics weekly train [analytics/refinery@2f097836] (duration: 02m 35s) [08:15:14] I'm quite confident that our patch should not cause an issue. The previous issue is probably caused due to SX roll-out. I'm checking. [08:15:54] Most likely, seems unrelated. You can test it without testserver and confirm :) [08:16:56] kart_, this caused that issue: 1130169: Enable Section Translation and Unified Dashboard on all wikipedias | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1130169 ... I can submit a fix we will have to backport that. [08:17:24] !log joal@deploy1003 Started deploy [analytics/refinery@2f09783] (thin): Regular analytics weekly train THIN [analytics/refinery@2f097836] [08:17:57] abijeet: sure. Let's continue with this as of now? [08:18:17] !log joal@deploy1003 Finished deploy [analytics/refinery@2f09783] (thin): Regular analytics weekly train THIN [analytics/refinery@2f097836] (duration: 00m 52s) [08:18:39] !log joal@deploy1003 Started deploy [analytics/refinery@2f09783] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f097836] [08:19:19] !log joal@deploy1003 Finished deploy [analytics/refinery@2f09783] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f097836] (duration: 00m 39s) [08:19:20] internet issues. not sure if my last message made through: [08:19:21] kart_, this caused that issue: 1130169: Enable Section Translation and Unified Dashboard on all wikipedias | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1130169 ... I can submit a fix we will have to backport that.  [08:19:48] Yes. I got the msg. [08:20:01] abijeet_: Should we continue with current deployment? [08:21:12] give me a moment. [08:22:53] kart_, yes, lets go ahead. I was just verifying the code once. [08:23:17] (03PS8) 10Superpes15: [pswiki] Change the wordmark and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) [08:23:36] !log kartik@deploy1003 kartik, abi: Continuing with sync [08:29:36] Is there any time for deploying a simple config patch during this window? [08:29:54] (03PS1) 10Hashar: Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) [08:30:54] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130345|AX: Disable automatic translation entrypoints before release (T389176)]] (duration: 25m 13s) [08:30:58] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [08:31:34] abijeet_: let's go with second patch? [08:31:56] Superpes: yes please add it to the window :) [08:32:09] kart_, sure [08:32:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [08:32:30] Done thanks hashar :) [08:33:13] (03Merged) 10jenkins-bot: AX: Add quick survey for MinT for Wikireaders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126617 (https://phabricator.wikimedia.org/T381886) (owner: 10Abijeet Patro) [08:33:26] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1126617|AX: Add quick survey for MinT for Wikireaders (T381886)]] [08:33:30] T381886: Show survey to users of MinT for Wiki Readers - https://phabricator.wikimedia.org/T381886 [08:34:03] Superpes: I will deploy your patch once they have finished deploy their patch :) [08:35:35] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.3.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1130949 [08:37:07] (03CR) 10Muehlenhoff: [C:03+2] dynamicproxy::api: Install python3-flask-sqlalchemy from "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1130078 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [08:37:09] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.3.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1130949 (owner: 10Volans) [08:37:24] !log joal@deploy1003 Started deploy [airflow-dags/analytics@001332b]: Regular analytics weekly train [airflow-dags/analytics@001332b5] [08:37:58] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@001332b]: Regular analytics weekly train [airflow-dags/analytics@001332b5] (duration: 00m 33s) [08:38:24] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1126617|AX: Add quick survey for MinT for Wikireaders (T381886)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:38:53] abijeet: testing on the testserver please. [08:41:47] kart_, on it [08:41:57] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.3.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1130949 (owner: 10Volans) [08:44:56] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825#10672680 (10ayounsi) If I understand correctly, even if it's a long term goal, we're going towards nftables. So should the nftab... [08:45:28] (03CR) 10Muehlenhoff: [C:03+2] Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1130104 (owner: 10Muehlenhoff) [08:46:09] abijeet: testing? [08:46:48] kart_, attending meeting and testing is difficult [08:46:56] kart_, but looks good. we can roll it out. [08:47:00] :) [08:47:07] cool. [08:47:08] !log joal@deploy1003 Started deploy [airflow-dags/analytics@324a662]: Regular analytics weekly train [airflow-dags/analytics@324a6629] [08:47:13] !log kartik@deploy1003 abi, kartik: Continuing with sync [08:47:38] !log joal@deploy1003 Finished deploy [airflow-dags/analytics@324a662]: Regular analytics weekly train [airflow-dags/analytics@324a6629] (duration: 00m 30s) [08:50:15] (03CR) 10DCausse: [C:04-1] WIP: wdqs: Add alerts for no lag metrics reported (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking) [08:50:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:52:05] hashar: few more minutes.. [08:52:46] I am still around and my guess is Superpes is around as well :b [08:53:04] andre: I'd like to extend a bit to get the Korean logo updated :) [08:53:14] hashar: sure sure [08:53:47] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924 (10acooper) 03NEW [08:54:26] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1126617|AX: Add quick survey for MinT for Wikireaders (T381886)]] (duration: 20m 59s) [08:54:31] T381886: Show survey to users of MinT for Wiki Readers - https://phabricator.wikimedia.org/T381886 [08:55:54] hashar: done. Sorry for the delay :) [08:56:17] (03PS1) 10Brouberol: Add monitoring over the mediawiki dumps legacy CephFS PVC available space [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) [08:56:19] no worries! [08:56:27] !log depooling & restarting blazegraph on wdqs1013 (deadlocked) [08:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:30] (03PS2) 10Brouberol: Add monitoring over the mediawiki dumps legacy CephFS PVC available space [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) [08:56:34] there is no rush in deploying those patches and they do require some verifications :) [08:57:03] Superpes: I am doing your patch :) [08:57:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130147 (https://phabricator.wikimedia.org/T389631) (owner: 10Superpes15) [08:57:51] Yep thanks :) [08:57:53] kart_, thanks! [08:58:09] (03CR) 10CI reject: [V:04-1] Add monitoring over the mediawiki dumps legacy CephFS PVC available space [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [08:58:25] (03Merged) 10jenkins-bot: [kowikiquote] Change the logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130147 (https://phabricator.wikimedia.org/T389631) (owner: 10Superpes15) [08:58:36] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1130147|[kowikiquote] Change the logo and wordmark (T389631)]] [08:58:40] T389631: Setting new logo & wordmark for kowikiquote - https://phabricator.wikimedia.org/T389631 [08:59:39] !log repooling wdqs1018 [08:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:57] (03CR) 10Muehlenhoff: [C:03+2] postgresl/osm_master: Make postgresql listen on ipv4 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1130611 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:00:05] andre and jnuche: MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0900). Please do the needful. [09:00:30] RESOLVED: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.234s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:01:42] (03PS1) 10Volans: Upstream release v1.3.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1130953 [09:02:00] (03CR) 10Volans: [C:03+2] Upstream release v1.3.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1130953 (owner: 10Volans) [09:02:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:03:41] !log hashar@deploy1003 superpes, hashar: Backport for [[gerrit:1130147|[kowikiquote] Change the logo and wordmark (T389631)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:03:45] T389631: Setting new logo & wordmark for kowikiquote - https://phabricator.wikimedia.org/T389631 [09:03:47] Testing! [09:05:50] 8) [09:06:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.128s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:06:31] Looks fine thanks :) [09:07:04] (03Merged) 10jenkins-bot: Upstream release v1.3.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1130953 (owner: 10Volans) [09:08:22] Superpes: thank you for the test! [09:08:25] !log hashar@deploy1003 superpes, hashar: Continuing with sync [09:09:34] (03PS13) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add rack/row to network_devices [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [09:12:35] (03PS7) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [09:13:16] I will deploy the patches to remove unused configs later [09:13:28] (03PS3) 10Brouberol: Add monitoring over the mediawiki dumps legacy CephFS PVC available space [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) [09:15:10] (03CR) 10Federico Ceratto: "Tested during https://phabricator.wikimedia.org/T389790" [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [09:15:57] (03CR) 10Filippo Giunchedi: "Yes the general idea LGTM, left some comments inline!" [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [09:15:59] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130147|[kowikiquote] Change the logo and wordmark (T389631)]] (duration: 17m 22s) [09:16:03] T389631: Setting new logo & wordmark for kowikiquote - https://phabricator.wikimedia.org/T389631 [09:16:18] !log uploaded python3-wmflib_1.3.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [09:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:40] Thanks for your time and assistance hashar :) [09:16:50] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cp4047.ulsfo.wmnet with reason: HW errors [09:16:50] (03CR) 10Filippo Giunchedi: [C:03+1] Promote some network alerts from warning to critical [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:16:56] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10672785 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ad64e6e6-2b04-43d8-9182-2119e76abaed) set by fabfur@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason:... [09:20:01] (03PS8) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [09:20:06] Superpes: thank you for the patch and the testing! [09:20:16] andre: the logo is updated! All your [09:20:21] thanks! [09:20:45] (03PS18) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [09:21:11] hashar: I assume I need to first deploy to testwikis (group-1) and then group0? [09:21:19] asking as that job failed [09:21:47] hmm [09:21:56] yeah correct [09:22:07] alright, will do. thanks [09:22:09] the thing that failed was the security patch [09:22:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.191s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:22:16] (03PS38) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [09:22:33] and that is going to build the 9GBytes image which would take a while to roll [09:22:34] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130958 (https://phabricator.wikimedia.org/T386217) [09:22:35] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130958 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:22:39] + all the caches etc [09:22:48] I say it is probably going to take an hour [09:22:58] uh, right [09:23:28] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130958 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:23:52] !log aklapper@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.22 refs T386217 [09:23:56] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [09:24:32] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825#10672823 (10MoritzMuehlenhoff) >>! In T389825#10672680, @ayounsi wrote: > If I understand correctly, even if it's a long term go... [09:27:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.191s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672829 (10phaultfinder) [09:32:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672861 (10phaultfinder) [09:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672871 (10phaultfinder) [09:42:19] (03CR) 10Fabfur: haproxy: using tmpfs directory for private tls material (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:43:26] (03PS18) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [09:43:45] (03CR) 10Filippo Giunchedi: "LGTM overall, left some comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:45:02] !log repooling wdqs1013 [09:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] (03PS1) 10Ladsgroup: Bump thumbnail steps to 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130962 (https://phabricator.wikimedia.org/T360589) [09:45:40] jouncebot: nowandnext [09:45:40] For the next 1 hour(s) and 14 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0900) [09:45:40] In 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1000) [09:48:36] (03PS1) 10Slyngshede: Alert when mirrors become out of date [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) [09:51:20] (03PS2) 10Ayounsi: Promote some network alerts from warning to critical [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) [09:51:20] (03PS1) 10Ayounsi: Add "scope: network" to network related alerts [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) [09:51:39] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672944 (10phaultfinder) [09:54:53] (03CR) 10Muehlenhoff: "Looks great, but I believe two things are missing (unless they changed in other ways between 7.0 and 7.1), see comments inline" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 (owner: 10Slyngshede) [09:55:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, thanks for clarifying what happens if we break the stats :)" [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:56:31] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [09:57:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T389381 [09:57:32] T389381: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T389381 [09:57:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2213 with weight 0 T389381', diff saved to https://phabricator.wikimedia.org/P74371 and previous config saved to /var/cache/conftool/dbconfig/20250325-095741-fceratto.json [09:57:59] (03CR) 10Ayounsi: [C:03+2] Promote some network alerts from warning to critical [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:58:03] (03CR) 10Cathal Mooney: [C:03+1] "Makes sense" [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:58:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms1 T387332', diff saved to https://phabricator.wikimedia.org/P74372 and previous config saved to /var/cache/conftool/dbconfig/20250325-095817-marostegui.json [09:58:22] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [09:59:13] (03Merged) 10jenkins-bot: Promote some network alerts from warning to critical [alerts] - 10https://gerrit.wikimedia.org/r/1130632 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10672986 (10phaultfinder) [09:59:48] (03CR) 10Filippo Giunchedi: "Neat, LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [09:59:52] (03CR) 10Filippo Giunchedi: [C:03+1] Add "scope: network" to network related alerts [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:00:05] andre and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T0900). [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1000) [10:00:24] (03CR) 10Ayounsi: [C:03+2] Add "scope: network" to network related alerts [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:00:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2142.codfw.wmnet [10:00:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2213 from API/vslow/dump T389381', diff saved to https://phabricator.wikimedia.org/P74373 and previous config saved to /var/cache/conftool/dbconfig/20250325-100055-fceratto.json [10:01:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1152.eqiad.wmnet [10:01:33] Still deploying 1.44.0-wmf.22 to testwikis (pre-train; group minus1). Afterwards tackling group0 [10:01:38] (03Merged) 10jenkins-bot: Add "scope: network" to network related alerts [alerts] - 10https://gerrit.wikimedia.org/r/1130965 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:04:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Maintenance in ms1 [10:05:01] (03CR) 10Marostegui: [C:03+1] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1129307 (https://phabricator.wikimedia.org/T389381) (owner: 10Gerrit maintenance bot) [10:05:09] (03CR) 10Federico Ceratto: [C:03+1] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1129307 (https://phabricator.wikimedia.org/T389381) (owner: 10Gerrit maintenance bot) [10:05:12] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1129307 (https://phabricator.wikimedia.org/T389381) (owner: 10Gerrit maintenance bot) [10:05:24] (03CR) 10Marostegui: [C:03+1] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1129310 (https://phabricator.wikimedia.org/T389383) (owner: 10Gerrit maintenance bot) [10:06:18] !log aklapper@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.22 refs T386217 (duration: 42m 25s) [10:06:22] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [10:07:12] !log Starting s5 codfw failover from db2192 to db2213 - T389381 [10:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:16] T389381: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T389381 [10:07:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2142.codfw.wmnet [10:08:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1152.eqiad.wmnet [10:08:16] !incidents [10:08:16] 5786 (UNACKED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [10:08:16] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [10:08:17] 5779 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [10:08:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T389381', diff saved to https://phabricator.wikimedia.org/P74374 and previous config saved to /var/cache/conftool/dbconfig/20250325-100825-fceratto.json [10:08:32] !ack 5786 [10:08:32] 5786 (ACKED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [10:08:41] I'm now going to deploy 1.44.0-wmf.22 to group0 [10:09:12] !incidents [10:09:13] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [10:09:13] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [10:09:13] 5779 (RESOLVED) GatewayBackendErrorsHigh sre (rate_limit_cluster api-gateway eqiad) [10:09:51] jelto: db2142 paged? [10:10:06] yep and then resolved after 30 seconds. I guess this is your maintenance? [10:10:13] jelto: yes, but: [11:04:20] <+logmsgbot> !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Maintenance in ms1 [10:10:20] Another downtime that was lost [10:11:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms1 T387332', diff saved to https://phabricator.wikimedia.org/P74375 and previous config saved to /var/cache/conftool/dbconfig/20250325-101101-marostegui.json [10:11:06] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [10:11:46] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:12:04] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130969 (https://phabricator.wikimedia.org/T386217) [10:12:05] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130969 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [10:12:17] "Another downtime that was lost" :/ [10:12:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2192 T389381', diff saved to https://phabricator.wikimedia.org/P74376 and previous config saved to /var/cache/conftool/dbconfig/20250325-101222-fceratto.json [10:12:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [10:12:27] T389381: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T389381 [10:12:43] Amir1: Unfortunately, I am seeing it very often lately [10:12:54] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130969 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [10:12:58] ack thanks for the info :) [10:13:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [10:16:44] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet [10:16:53] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2192.codfw.wmnet [10:17:20] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet [10:17:28] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2192.codfw.wmnet [10:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove hosts from x2 T387332', diff saved to https://phabricator.wikimedia.org/P74377 and previous config saved to /var/cache/conftool/dbconfig/20250325-101805-marostegui.json [10:18:10] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [10:18:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr1-drmrs (185.15.58.142) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:18:44] (03CR) 10Stevemunene: [C:03+2] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [10:18:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/3 (Core: asw1-b12-drmrs:et-0/0/49 {#D0100}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:28] (03PS1) 10Marostegui: conftool: Remove x2 from allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1130970 (https://phabricator.wikimedia.org/T387332) [10:20:57] marostegui: btw I checked and it doesn't look like the downtime was processed by icinga, i.e. no trace /srv/icinga-logs/icinga-03-25-2025-00.log even though spicerack does a best effort attempt at checking (i.e. T309447) [10:20:57] T309447: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 [10:21:45] godog: Then the output is sort of a lie? cause I got: Created silence ID a808be2d-3347-4f1c-9f16-96c07d273c16 [10:22:12] marostegui: that is the alertmanager id, did you get anything for icinga ? [10:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:22:31] (03PS2) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [10:22:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet [10:22:43] godog: This is the full output: https://phabricator.wikimedia.org/P74378 [10:23:03] godog: Should we maybe change the output to make it clearer that a downtime may have not been processed? [10:23:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/3 (Core: asw1-b12-drmrs:et-0/0/49 {#D0100}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:23:59] godog: marostegui "Maybe the page was triggered by an email to vops sent directly from Icinga, so the silence in Alertmanager is not effective [10:23:59] marostegui: I'm reading spicerack/icinga.py and that should be already the case when the downtime is not found, not sure what's going on tbh [10:24:15] (03CR) 10Ladsgroup: [C:03+1] conftool: Remove x2 from allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1130970 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:24:18] (03CR) 10Marostegui: [C:03+2] conftool: Remove x2 from allowed sections [puppet] - 10https://gerrit.wikimedia.org/r/1130970 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:24:38] (03PS3) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [10:24:39] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2192.codfw.wmnet [10:24:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet [10:24:52] 07SRE-Unowned, 10Maps: Build and import imposm 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780#10673065 (10TheDJ) [10:24:57] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2192 - Upgrading db2192.codfw.wmnet - fceratto@cumin1002 [10:25:05] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2192 - Upgrading db2192.codfw.wmnet - fceratto@cumin1002 [10:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:26:12] why we didn't get a page on IRC/ [10:26:12] ? [10:26:13] ^^ it'll be fine, we're just runing on the limit of alerting because one-DC [10:26:32] volans: i think because of the silence on alertmanger [10:26:33] tappof: in theory the downtime should have been both icinga and alertmanager according to https://phabricator.wikimedia.org/P74378 [10:26:40] (03CR) 10Ayounsi: [C:03+1] "thx for the clarification on the task." [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [10:26:50] (03PS1) 10Marostegui: sections.yaml: Remove x2 [puppet] - 10https://gerrit.wikimedia.org/r/1130972 (https://phabricator.wikimedia.org/T387332) [10:26:52] Mar 25 10:07:25 alert1002 icinga[4514]: EXTERNAL COMMAND: DEL_DOWNTIME_BY_HOST_NAME;db2142 [10:27:27] (03CR) 10Ladsgroup: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1130972 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:27:43] godog: ^^^ [10:27:54] (03CR) 10Marostegui: [C:03+2] sections.yaml: Remove x2 [puppet] - 10https://gerrit.wikimedia.org/r/1130972 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:28:17] it was set at 10:04:19 [10:28:40] volans: hah, I stand corrected, didn't realize not all icinga output makes it to /srv/icinga-logs :( [10:28:47] marostegui: which script/cookbook were you running? [10:28:56] cookbook sre.hosts.downtime [10:29:20] after that [10:29:23] volans: https://phabricator.wikimedia.org/P74378 [10:29:30] because something did remove the downtime [10:29:41] volans: The upgrade script [10:29:50] link? [10:29:55] volans: Sorry, sre.mysql.upgrade [10:30:04] k [10:30:15] I'm looking into icinga-wm in the meantime [10:30:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/mysql/upgrade.py#56 [10:30:23] this removes the downtime [10:30:23] So that cookbook downtimes, then I added some additional one via the above, and then sre.mysql.upgrade finished. But why did it page BEFORE it finished? [10:30:30] and icinga doesn't have a concept of my downtime or your downtime [10:30:35] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [10:30:40] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2192 slowly with 10 steps - Upgrade of db2192.codfw.wmnet completed - fceratto@cumin1002 [10:30:49] marostegui: you can add a wait for icinga green [10:30:50] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.22 refs T386217 [10:30:51] before exiting [10:30:54] volans: It removes all downtime? [10:30:54] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [10:30:58] so that it doesn't remove the downtime [10:31:01] volans: Not its own created downtime? [10:31:04] no [10:31:07] icinga limitation [10:31:09] :-/ [10:31:09] from the API [10:31:27] that's my cue to say: please consider working on migrating pages to AM [10:31:54] andre: can you ping me when done with the train? [10:32:19] claime: done with deploying. Crossing fingers I don't need to roll back :) [10:32:24] For some reason, the deployment calendar has the train and the infra window start times separated by 1h, but jouncebot got confused I guess? [10:32:44] it's those three weeks of Daylight Confusion Time [10:32:47] yeah [10:32:56] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932 (10Joe) 03NEW [10:32:58] ok well, gonna deploy https://gerrit.wikimedia.org/r/c/1127882/ [10:33:04] godog: ^ [10:33:08] claime: ack [10:33:16] do it! :) [10:33:22] (03PS1) 10Marostegui: valid_section.pp: Remove x2 [puppet] - 10https://gerrit.wikimedia.org/r/1130974 (https://phabricator.wikimedia.org/T387332) [10:33:30] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [10:33:35] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673130 (10Joe) [10:33:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:et-0/0/3 (Core: asw1-b12-drmrs:et-0/0/49 {#D0100}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:34:02] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673134 (10Joe) [10:34:11] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10673136 (10FCeratto-WMF) Related to https://phabricator.wikimedia.org/T388127 [10:34:18] (03CR) 10Ladsgroup: [C:03+1] "It's crazy we have so many places for valid sections *end of rant*" [puppet] - 10https://gerrit.wikimedia.org/r/1130974 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673141 (10phaultfinder) [10:35:52] (03CR) 10Marostegui: [C:03+2] valid_section.pp: Remove x2 [puppet] - 10https://gerrit.wikimedia.org/r/1130974 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [10:36:06] (03Merged) 10jenkins-bot: mediawiki: Change kafka topic for rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127882 (https://phabricator.wikimedia.org/T384335) (owner: 10Clément Goubert) [10:36:33] godog: if you want to add an additional check to downtime there is also wait_for_downtimed() [10:37:32] !log cgoubert@deploy1003 Started scap sync-world: 1127882: mediawiki: Change kafka topic for rsyslog - T384335 [10:37:36] T384335: Move rsyslog-generated mediawiki logs within k8s to their own kafka topics - https://phabricator.wikimedia.org/T384335 [10:38:30] !log cgoubert@deploy1003 cgoubert: 1127882: mediawiki: Change kafka topic for rsyslog - T384335 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:39:37] !log bounce ircecho on alert1002 - exceptions in journal [10:39:38] (03PS1) 10Marostegui: db2243: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) [10:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:03] volans: I'm checking the code and it looks like downtime() does try to wait for it ? [10:40:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673179 (10phaultfinder) [10:41:13] (03CR) 10Ladsgroup: db2243: Productionize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) (owner: 10Marostegui) [10:41:25] godog: right, sorry, we do that in spicerack itself, had forgot [10:41:45] (03PS2) 10Marostegui: db2243: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) [10:41:48] (03CR) 10Marostegui: db2243: Productionize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) (owner: 10Marostegui) [10:42:53] (03CR) 10Fabfur: "Late to comment that this will be removed when varnishkafka will not be used anymore (and haproxykafka will become the standard). ATM anyw" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130681 (owner: 10BCornwall) [10:43:19] (03CR) 10Ladsgroup: [C:03+1] db2243: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) (owner: 10Marostegui) [10:43:31] (03CR) 10Marostegui: [C:03+2] db2243: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1130976 (https://phabricator.wikimedia.org/T388684) (owner: 10Marostegui) [10:44:19] godog: I don't have any errors in rsyslog, but I don't see any messages in k8s-mw-eqiad [10:44:27] Anything important going on? I'd like to run a script on Flow boards in dry-run mode [10:44:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T389383 [10:44:35] T389383: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T389383 [10:44:46] zip: deploying some stuff but shouldn't impact running a script [10:44:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2218 with weight 0 T389383', diff saved to https://phabricator.wikimedia.org/P74380 and previous config saved to /var/cache/conftool/dbconfig/20250325-104452-fceratto.json [10:45:09] claime: great, thank you [10:45:21] claime: checking [10:45:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2218 from API/vslow/dump T389383', diff saved to https://phabricator.wikimedia.org/P74381 and previous config saved to /var/cache/conftool/dbconfig/20250325-104526-fceratto.json [10:45:53] claime: if you're deploying, can you let me know when I'm in the clear to deploy a change? [10:45:55] godog: I've only deployed to mw-debug for now [10:46:00] Amir1: sure [10:46:04] Thanks! [10:46:15] (03PS2) 10Ayounsi: Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) [10:46:17] godog: I'm gonna let httpbb run in a loop to generate traffic [10:46:26] claime: ack, makes sense thank you [10:46:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P74382 and previous config saved to /var/cache/conftool/dbconfig/20250325-104630-root.json [10:46:39] (03CR) 10Ayounsi: "Not sure why the test is not passing. Did I get my math (and series) wrong ?" [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:46:57] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 78240 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:47:32] (03PS1) 10Volans: sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 [10:47:53] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for SRE IF with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130944 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [10:48:45] (03CR) 10Volans: "Quick suggestion to prevent unwanted pages as a follow up from the IRC chat in operations few moments ago." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 (owner: 10Volans) [10:49:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Cloning db2243 [10:50:01] godog: maybe the synthetic traffic from httpbb just doesn't generate messages that would have gone to udp_localhost-* ? [10:50:10] !log Starting s7 codfw failover from db2220 to db2218 - T389383 [10:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:14] T389383: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T389383 [10:50:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673228 (10phaultfinder) [10:50:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74384 and previous config saved to /var/cache/conftool/dbconfig/20250325-105049-root.json [10:50:51] claime: that's indeed possible too, I would have expected at least some messages from mw-debug heh [10:51:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2218 to s7 primary T389383', diff saved to https://phabricator.wikimedia.org/P74385 and previous config saved to /var/cache/conftool/dbconfig/20250325-105108-fceratto.json [10:51:12] claime: not sure what's the best way to trigger something on demand, maybe an exception of some kind ? [10:51:18] (03PS2) 10Volans: sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 [10:51:21] godog: prolly fatal.php [10:51:38] (03CR) 10Fabfur: "Ready for review and eventually merge" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:52:45] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2181.codfw.wmnet onto db2243.codfw.wmnet [10:53:19] (03PS1) 10Muehlenhoff: Remove urldownloader[12]00[12] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1130978 [10:53:32] claime: ah yeah, what's the easiest way to trigger that ? [10:53:46] godog: i don't remember lolsob [10:54:08] proceeding with flow script [10:54:11] (03PS27) 10Fabfur: haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) [10:54:35] (03CR) 10Marostegui: [C:03+1] "Thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 (owner: 10Volans) [10:54:45] godog: Ah, I triggered an "Error: password not recognized." error [10:55:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:56:09] (03CR) 10Btullis: [C:03+2] New ferm rule to permit HDFS data flows and mark as low-prio for qos [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [10:57:29] glorious victory /s https://phabricator.wikimedia.org/P74386 [10:58:01] claime: I am not seeing k8s-mw topic still mmhhh [10:58:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2220.codfw.wmnet [10:58:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [10:58:31] (03CR) 10Muehlenhoff: [C:03+2] Remove urldownloader[12]00[12] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1130978 (owner: 10Muehlenhoff) [10:59:14] (03PS1) 10Marostegui: clone.py: Fixed typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1130980 [10:59:57] (03CR) 10Ladsgroup: [C:03+2] clone.py: Fixed typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1130980 (owner: 10Marostegui) [11:00:18] (03PS2) 10Slyngshede: Alert when mirrors become out of date [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) [11:00:50] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:00:56] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 694820336 and 40 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:01:22] federico3: that is db2220 is, that yours? ^ [11:01:42] yes, doing the depooling an the script is failing to commit [11:01:48] Why so? [11:01:51] one sec [11:02:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Configure db2220 T389383', diff saved to https://phabricator.wikimedia.org/P74389 and previous config saved to /var/cache/conftool/dbconfig/20250325-110217-fceratto.json [11:02:21] fixed [11:02:22] T389383: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T389383 [11:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P74390 and previous config saved to /var/cache/conftool/dbconfig/20250325-110222-root.json [11:03:24] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [11:03:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [11:03:45] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db2192 slowly with 10 steps - Upgrade of db2192.codfw.wmnet completed - fceratto@cumin1002 [11:03:47] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.depool (exit_code=97) db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [11:04:56] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 57576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:05:00] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:05:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2220 T389383', diff saved to https://phabricator.wikimedia.org/P74392 and previous config saved to /var/cache/conftool/dbconfig/20250325-110505-fceratto.json [11:05:45] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2220.codfw.wmnet [11:05:48] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2220.codfw.wmnet [11:05:50] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:05:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74393 and previous config saved to /var/cache/conftool/dbconfig/20250325-110554-root.json [11:05:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [11:06:05] (03PS2) 10Fabfur: First proposal to commit vendored dependencies [debs/benthos] - 10https://gerrit.wikimedia.org/r/1130141 (https://phabricator.wikimedia.org/T388261) [11:06:08] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2220 - Upgrading db2220.codfw.wmnet - fceratto@cumin1002 [11:06:26] (03Merged) 10jenkins-bot: clone.py: Fixed typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1130980 (owner: 10Marostegui) [11:06:52] haHA [11:07:03] well I generated an exception [11:07:21] (03PS1) 10Muehlenhoff: Reimage maps-test2001/2002 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1130982 (https://phabricator.wikimedia.org/T381565) [11:07:23] (03PS1) 10Muehlenhoff: Apply maps master role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1130983 (https://phabricator.wikimedia.org/T381565) [11:07:24] (03PS1) 10Muehlenhoff: Apply maps/replica role to maps-test2002 [puppet] - 10https://gerrit.wikimedia.org/r/1130984 (https://phabricator.wikimedia.org/T381565) [11:08:32] and yet no k8s-mw-* topics AFAICS ? siiigh [11:08:50] (03CR) 10Slyngshede: Alert when mirrors become out of date (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:10:00] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:10:42] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2192.codfw.wmnet [11:11:17] godog: yeah [11:11:36] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2220.codfw.wmnet [11:11:47] godog: can we see that exception in another topic [11:11:49] is the question [11:12:04] good question indeed [11:12:42] godog: it's in logstash [11:13:03] asadfsd it went to mw-web [11:13:06] which isn't deployed [11:13:36] XWD disengaged [11:13:38] I'm done with my dry-runs [11:13:41] ok, re-engaged [11:13:47] regenerated exceptions [11:13:50] still no topic [11:14:09] )o) [11:14:38] ecs-mediawiki-1-1.11.0-7-2025.13 [11:14:41] is the index [11:14:44] (03PS1) 10Muehlenhoff: Create insetup role for SRE Collab with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) [11:14:47] but it should be in syslog somewhere [11:14:48] yeah I'm looking at tags [11:14:50] tags [11:14:50] input-kafka-rsyslog-udp-localhost, rsyslog-udp-localhost, kafka, es, es [11:15:21] claime: I'm looking at id CaEBzZUBF1zH-YDau-xq [11:15:23] to be clear [11:16:08] !log installing Python 3.11 security updates [11:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74397 and previous config saved to /var/cache/conftool/dbconfig/20250325-111727-root.json [11:17:51] Next up: any objections if I do a real run that is expected to have this consequence: https://phabricator.wikimedia.org/P74388 [11:18:19] ie moving two Flow pages in fiwikimedia [11:21:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74398 and previous config saved to /var/cache/conftool/dbconfig/20250325-112059-root.json [11:22:35] !log cgoubert@deploy1003 Sync cancelled. [11:23:30] (03PS1) 10Clément Goubert: mediawiki: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130987 [11:23:45] zip: go ahead [11:23:47] (03PS4) 10Anzx: knwikisource, tcywikisource: add translate namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130981 (https://phabricator.wikimedia.org/T388955) [11:24:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130981 (https://phabricator.wikimedia.org/T388955) (owner: 10Anzx) [11:25:17] (03PS2) 10Muehlenhoff: Create insetup role for SRE Collab with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) [11:26:22] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130987 (owner: 10Clément Goubert) [11:27:35] grand, thank you [11:28:45] (03Merged) 10jenkins-bot: mediawiki: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130987 (owner: 10Clément Goubert) [11:31:29] all done [11:32:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74400 and previous config saved to /var/cache/conftool/dbconfig/20250325-113233-root.json [11:33:10] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:33:10] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673629 (10cmooney) Things are looking good after the application of the change, an-worker nodes are correctly... [11:33:18] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:33:42] !log cgoubert@deploy1003 Started scap sync-world: 1127882: mediawiki: Change kafka topic for rsyslog - T384335 [11:33:46] T384335: Move rsyslog-generated mediawiki logs within k8s to their own kafka topics - https://phabricator.wikimedia.org/T384335 [11:34:28] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:35:38] (03PS1) 10Zoe: Archive user talk pages even if the userpage doesn't exist [extensions/Flow] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130989 (https://phabricator.wikimedia.org/T380911) [11:35:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673644 (10phaultfinder) [11:36:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74401 and previous config saved to /var/cache/conftool/dbconfig/20250325-113604-root.json [11:36:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Flow] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130989 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [11:38:27] !log cgoubert@deploy1003 cgoubert: 1127882: mediawiki: Change kafka topic for rsyslog - T384335 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:39:12] godog: ^^ you should see messages, including at least one exception, on the new topic [11:39:32] tell me if I can proceed with deploying the change to the rest of prod [11:39:54] claime: checking on logstash [11:40:28] tags [11:40:29] input-kafka-k8s, rsyslog-shipper, kafka, es [11:40:42] for an exception i generated [11:40:55] yeah totally, LGTM to go claime [11:41:03] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:42:34] FYI I'm watching this guy https://grafana.wikimedia.org/goto/5xNkdvTHR?orgId=1 [11:43:26] godog: my kafkacat is flooding so I think it works x) [11:44:13] lolz [11:44:25] PROBLEM - MariaDB Replica Lag: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3113.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:44:29] PROBLEM - MariaDB Replica Lag: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3117.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:44:30] PROBLEM - MariaDB Replica Lag: s7 #page on db2222 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3117.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:44:40] !incidents [11:44:40] 5787 (UNACKED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [11:44:40] 5788 (UNACKED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [11:44:41] 5789 (UNACKED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [11:44:41] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [11:44:41] PROBLEM - MariaDB Replica Lag: s7 #page on db2221 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3130.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:44:41] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [11:44:55] PROBLEM - MariaDB Replica Lag: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3144.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:44:59] PROBLEM - MariaDB Replica Lag: s7 #page on db2208 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3146.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:00] PROBLEM - MariaDB Replica Lag: s7 #page on db2218 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3146.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:05] PROBLEM - MariaDB Replica Lag: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3153.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:45:37] What?? [11:45:44] (03PS1) 10Muehlenhoff: Record LDAP access for kevmon [puppet] - 10https://gerrit.wikimedia.org/r/1130994 [11:45:54] federico3: ^ [11:46:05] !incidents [11:46:05] 5787 (ACKED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:05] 5788 (ACKED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:05] 5789 (ACKED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:06] 5790 (ACKED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:06] 5791 (ACKED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:06] 5792 (ACKED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:06] 5793 (ACKED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:07] 5794 (ACKED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [11:46:07] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [11:46:07] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [11:46:19] I acked all pages, all s7 replication lag in codfw [11:46:30] ack jelto [11:47:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74402 and previous config saved to /var/cache/conftool/dbconfig/20250325-114738-root.json [11:47:44] Something wrong with pt-heartbeat I think [11:48:22] !log cgoubert@deploy1003 Finished scap sync-world: 1127882: mediawiki: Change kafka topic for rsyslog - T384335 (duration: 15m 00s) [11:48:26] T384335: Move rsyslog-generated mediawiki logs within k8s to their own kafka topics - https://phabricator.wikimedia.org/T384335 [11:48:28] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:48:29] Amir1: all good, you can deploy [11:48:55] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for kevmon [puppet] - 10https://gerrit.wikimedia.org/r/1130994 (owner: 10Muehlenhoff) [11:49:12] I found the issue [11:49:19] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1129310 (https://phabricator.wikimedia.org/T389383) (owner: 10Gerrit maintenance bot) [11:49:48] I spot-checked some of the hosts metrics and can't find any significant replication lag in grafana [11:50:26] I am merging the above patch, which should fix it [11:50:34] ack thanks [11:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1248 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74403 and previous config saved to /var/cache/conftool/dbconfig/20250325-115109-root.json [11:52:38] For what is worth, the lag isn't real [11:52:44] (Also codfw is depooled, so no user impact) [11:52:55] RECOVERY - MariaDB Replica Lag: s7 #page on db2159 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:59] RECOVERY - MariaDB Replica Lag: s7 #page on db2208 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:00] RECOVERY - MariaDB Replica Lag: s7 #page on db2218 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:05] RECOVERY - MariaDB Replica Lag: s7 #page on db2168 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:25] RECOVERY - MariaDB Replica Lag: s7 #page on db2182 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:27] !incidents [11:53:28] 5787 (ACKED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:28] 5788 (ACKED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:28] 5789 (ACKED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:29] RECOVERY - MariaDB Replica Lag: s7 #page on db2150 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:29] 5790 (ACKED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:29] 5794 (RESOLVED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:29] 5793 (RESOLVED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:29] 5792 (RESOLVED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:29] RECOVERY - MariaDB Replica Lag: s7 #page on db2222 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:30] 5791 (RESOLVED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [11:53:30] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [11:53:31] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [11:53:41] RECOVERY - MariaDB Replica Lag: s7 #page on db2221 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:54:09] all page.s are resolved [11:54:30] (03CR) 10Elukey: [C:03+1] Reimage maps-test2001/2002 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1130982 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:54:47] (03CR) 10Elukey: [C:03+1] Apply maps master role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1130983 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:54:58] (03CR) 10Elukey: [C:03+1] Apply maps/replica role to maps-test2002 [puppet] - 10https://gerrit.wikimedia.org/r/1130984 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:55:11] (03CR) 10Muehlenhoff: [C:03+2] Reimage maps-test2001/2002 to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1130982 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:55:31] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10673685 (10toni.stoev) >>! In T214998#10552803, @toni.stoev wrote: > Shall a committee be formed? In orde... [11:55:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673692 (10phaultfinder) [11:58:40] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2006 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130995 (https://phabricator.wikimedia.org/T387854) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1200) [12:00:17] sorry I missed all the fun [12:01:21] Amir1: in case it got lost in the alertsauce, I'm done deploying so you can go ahead [12:01:27] Thanks! [12:01:32] (sorry I was in a meeting [12:01:37] np [12:02:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74404 and previous config saved to /var/cache/conftool/dbconfig/20250325-120244-root.json [12:04:57] 10ops-esams, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389874#10673718 (10phaultfinder) [12:05:05] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 (owner: 10Slyngshede) [12:09:26] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130962 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:09:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130962 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:10:17] (03Merged) 10jenkins-bot: Bump thumbnail steps to 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130962 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:10:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [12:10:45] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1130962|Bump thumbnail steps to 40% (T360589)]] [12:10:49] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:10:49] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10673732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [12:12:07] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [12:15:09] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10673751 (10toni.stoev) >>! In T214998#9990688, @Jdforrester-WMF wrote: > Note for those interested that th... [12:15:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10673755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by... [12:15:50] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 5 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5147/" [puppet] - 10https://gerrit.wikimedia.org/r/1130995 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [12:16:06] (03CR) 10Jelto: "looks mostly good, but I think you have to update the cumin aliases as well https://gerrit.wikimedia.org/r/plugins/gitiles/operations/pupp" [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [12:16:37] I was looking at our Parsoid log board, and I noticed a bunch of " Error fetching URL "http://localhost:6005/v1/events": (curl error: 52) Server returned nothing (no headers, no data)". these are now not emitted on that board anymore, fitting with the timeframe of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127882 (which makes sense, i *think*) [12:16:40] two questions: [12:17:26] 1/ is that log something i (or someone) should be worried about? 2/ if yes, should it be back on my parsoid log board and how would i do that? [12:17:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74405 and previous config saved to /var/cache/conftool/dbconfig/20250325-121749-root.json [12:18:05] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1130962|Bump thumbnail steps to 40% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:18:09] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:19:12] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:20:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10673767 (10phaultfinder) [12:21:28] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [12:22:03] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::worker: move ml-serve2006 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1130995 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [12:22:12] ihurbain: can you link me the parsoid log board please? [12:22:33] it should not have changed anything as logstash should still be ingesting the logs [12:23:24] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [12:23:27] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: try Supermicro BMC passwords automatically [cookbooks] - 10https://gerrit.wikimedia.org/r/1130140 (https://phabricator.wikimedia.org/T386946) (owner: 10Elukey) [12:23:30] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [12:23:44] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply [12:23:45] claime: https://logstash.wikimedia.org/goto/380e1e3389f1a81e3f864cd71036e1f5 (we may have filters in place that change STUFF on that board, mind you) [12:23:52] (03PS3) 10Muehlenhoff: Create insetup role for SRE Collab with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) [12:24:26] (03CR) 10Muehlenhoff: "Good catch, updated the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [12:26:40] ihurbain: I have a log for that exact error from a minute ago (adding "curl error: 52") to the top bar search [12:26:45] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130962|Bump thumbnail steps to 40% (T360589)]] (duration: 16m 00s) [12:26:49] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:27:45] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2006.codfw.wmnet with OS bookworm [12:28:01] ihurbain: Mar 25, 2025 @ 12:25:57.764 [12:28:08] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2006 [12:28:24] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:28:25] claime: indeed. ignore me, i must have done something fishy with my filters. (logstash is not me-friendly ^^;) [12:28:37] it's not friendly in general imo lol [12:29:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [12:29:42] (03CR) 10Jelto: [C:03+1] "lgtm now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [12:31:39] so. the answer to the first question is probably mmyeeaah that's probably worth a look (i haven't found it in phab either yet but then i don't trust my phab search either :D ) [12:31:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [12:32:51] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2006 - elukey@cumin1002" [12:34:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2006 - elukey@cumin1002" [12:34:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:00] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2006.codfw.wmnet 115.16.192.10.in-addr.arpa 5.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:34:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2006.codfw.wmnet 115.16.192.10.in-addr.arpa 5.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:34:04] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2006 [12:34:46] (03PS1) 10Klausman: role::ml_k8s::worker: move ml-serve2010 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131001 (https://phabricator.wikimedia.org/T387854) [12:35:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2006 [12:35:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2006 [12:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:36:55] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1131001 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [12:37:14] (03CR) 10Klausman: [V:03+2 C:03+2] role::ml_k8s::worker: move ml-serve2010 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131001 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [12:37:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-ml-serve_31443: Servers ml-serve2006.codfw.wmnet are marked down but pooled: inference_30443: Servers ml-serve2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:38:10] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-ml-serve_31443: Servers ml-serve2006.codfw.wmnet are marked down but pooled: inference_30443: Servers ml-serve2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:39:20] I am reimaging it --^ [12:39:32] it didn't trigger any alter the other times, weird [12:41:10] elukey: you should depool the servers. maybe you did the last time? [12:41:13] the backend ones [12:41:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext releases routed via main (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:41:32] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10673820 (10BTullis) 05Open→03Resolved a:03BTullis [12:42:04] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2010.codfw.wmnet with OS bookworm [12:42:26] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-serve2010 [12:42:26] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2010 [12:44:11] sukhe: I didn't, they are totally depooled from k8s but not lvs, I'll add the depool flag next time [12:46:42] PROBLEM - BGP status on lsw1-c2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [12:49:57] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10673869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm completed: - maps-test2001 (**PASS**)... [12:50:28] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:17] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [12:55:27] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2010.codfw.wmnet with reason: host reimage [12:56:16] (03PS1) 10DLynch: Edit check: add editcheck-references-shown to the allowed tags list [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131003 (https://phabricator.wikimedia.org/T373949) [12:56:26] (03PS1) 10DLynch: Edit check: don't close the sidebar on context change on desktop [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131004 (https://phabricator.wikimedia.org/T389906) [12:56:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131003 (https://phabricator.wikimedia.org/T373949) (owner: 10DLynch) [12:56:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131004 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [12:57:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [12:58:11] (03CR) 10Jelto: [C:03+1] "looks like a reasonable workaround" [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1300) [13:00:05] phuedx, MatmaRex, Daimona, tgr, anzx, zip, and kemayo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:14] o/ [13:00:17] o/ [13:00:40] * TheresNoTime is in a meeting and can't deploy today, sorry! [13:00:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS bookworm [13:00:44] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2010.codfw.wmnet with reason: host reimage [13:00:58] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10673918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm [13:01:17] hi [13:01:38] hi hi [13:02:33] o/ [13:07:28] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:07:51] elukey: no worries but yeah, that's what pybal is complaining about [13:07:53] any deployers around? i don't have access myself [13:09:21] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673952 (10elukey) For record keeping, afaics this module is used by w... [13:09:43] * phuedx wonders if they still have access [13:10:04] Technically I have access, but I have never actually run a deployment window, so that might be a last resort. [13:10:28] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:10:32] (03PS6) 10Slyngshede: Upgrade CAS to version 7.1.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 [13:11:28] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 36, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:47] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673956 (10elukey) @Volans @dcaro from what I can see the repo got a n... [13:14:08] 10ops-codfw, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10673957 (10Marostegui) @Papaul @Jhancock.wm we want to test the new supermicro controller in databases. Could you just pull out a disk and let me know when that has happened. You can do... [13:14:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2006.codfw.wmnet with OS bookworm [13:14:51] sounds like maybe it's not happening? [13:15:43] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2010.codfw.wmnet with OS bookworm [13:15:44] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:46] I have access. I haven't done this in a while but it shouldn't be so bad – especially as we have scap backport [13:16:57] MatmaRex: Yours first [13:17:27] (03CR) 10Arnaudb: [C:03+1] Create insetup role for SRE Collab with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:17:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:17:47] T388150: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150 [13:17:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:18:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:18:13] phuedx: thanks [13:18:32] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:18:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130596 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [13:19:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:19:04] phuedx: you can probably do several patches at once to speed up the process [13:19:11] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10673977 (10dcaro) >>! In T354410#10673956, @elukey wrote: > @Volans @d... [13:19:20] (03CR) 10Marostegui: [C:03+1] clone.py, clone_test.py: Check if the target host is known to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [13:19:37] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2181.codfw.wmnet onto db2243.codfw.wmnet [13:19:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [13:19:58] MatmaRex: Noted. I should have asked if yours could be done together [13:22:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [13:23:17] (03CR) 10Slyngshede: Upgrade CAS to version 7.1.4 (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 (owner: 10Slyngshede) [13:24:22] MatmaRex: OIC +2 a bunch of patches so that we don't have to wait on merging [13:25:11] (03PS1) 10Klausman: role::ml_k8s::worker: move ml-serv2009 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131005 (https://phabricator.wikimedia.org/T387854) [13:26:08] phuedx: that too, but i'm not sure how doing that interacts with scap backport [13:26:13] (03CR) 10Klausman: [V:03+2 C:03+2] role::ml_k8s::worker: move ml-serv2009 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131005 (https://phabricator.wikimedia.org/T387854) (owner: 10Klausman) [13:26:39] MatmaRex: Yeah. Neither am I. I think I'll play it safe since it's been a while [13:26:50] I'll try to get through as many as I can. Please bear with me [13:27:47] (03PS21) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [13:29:29] (03CR) 10Ssingh: [C:03+1] "Looks good. I know you have the two host overrides but still recommend disabling Puppet in A:cp and trying a few hosts before rolling it o" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:30:25] FIRING: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:24] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2009.codfw.wmnet with OS bookworm [13:31:26] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1011 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:31:45] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-serve2009 [13:31:46] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2009 [13:32:26] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:33:34] (03Merged) 10jenkins-bot: Restore deprecated aliases for CommentStoreComment and RawMessage [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130596 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [13:33:57] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1130596|Restore deprecated aliases for CommentStoreComment and RawMessage (T388725)]] [13:34:01] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [13:34:38] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1011 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:35:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.eqiad.wmnet with OS bullseye [13:35:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bkin... [13:35:52] PROBLEM - BGP status on lsw1-b7-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:56] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674046 (10Volans) This is a great news! With our current pinning in s... [13:36:12] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10674050 (10Marostegui) [13:37:04] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for SRE Collab with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1130985 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:38:12] zip, Kemayo: Looking at the amount of time this is taking and the amount of patches in the window, I don't think I'm going to get around to yours. Can you reschedule? [13:39:01] aw [13:39:02] sure [13:39:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2002.codfw.wmnet with OS bookworm [13:40:03] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10674081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm completed: - maps-test2002 (**PASS**)... [13:40:33] !log phuedx@deploy1003 phuedx, matmarex: Backport for [[gerrit:1130596|Restore deprecated aliases for CommentStoreComment and RawMessage (T388725)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:37] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [13:40:52] there aren't any deployments scheduled for several hours after now, so you could probably just put your own window on the calendar and then run it. Kemayo, zip [13:41:00] MatmaRex: Please check on the test servers [13:41:19] phuedx: seems good [13:41:26] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1011 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:41:28] Ack [13:41:28] (03PS3) 10Ayounsi: Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) [13:41:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674097 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking [13:41:34] (03PS2) 10Esanders: Enable DiscussionTools auto subscriptions for all interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076737 (https://phabricator.wikimedia.org/T290778) [13:41:36] !log phuedx@deploy1003 phuedx, matmarex: Continuing with sync [13:42:26] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1011 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:42:42] phuedx, zip, MatmaRex : yeah, if we run out of time I do feel more comfortable getting my own patches later compared to running the whole window. [13:43:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674109 (10bking) This host is still failing to reimage. I'm going to reopen/grab this... [13:43:12] (03CR) 10Ayounsi: "Thanks Tiziano for the help !" [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [13:44:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74406 and previous config saved to /var/cache/conftool/dbconfig/20250325-134426-root.json [13:44:32] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2009.codfw.wmnet with reason: host reimage [13:44:36] (03CR) 10Fabfur: "actually, with the latest patchsets the only override is for cp4047. Anyway, I'll disable puppet on A:cp and test also hosts that shouldn'" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:44:38] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1011 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:45:07] phuedx: do you have the time to do another backport? we'll probably overrun the window, judging by how long this one took, so it's up to you [13:45:19] thanks for stepping up to do the deploys in the first place btw :) [13:45:25] RESOLVED: [3x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:07] MatmaRex: I've got time [13:46:13] Doesn't help that this is the fullest window I've seen in a while. :D [13:46:23] :D [13:46:44] !log disable puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129223 (T384227) [13:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:49] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [13:47:18] (03CR) 10Fabfur: [C:03+2] haproxy: using tmpfs directory for private tls material [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:47:31] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2009.codfw.wmnet with reason: host reimage [13:47:34] (03PS1) 10Bking: cloudelastic: switch cloudelastic1008 to EFI [puppet] - 10https://gerrit.wikimedia.org/r/1131010 (https://phabricator.wikimedia.org/T388150) [13:47:54] Almost done. The config changes go faster, right?! D: [13:47:56] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply updated master config - bking@cumin2002 - T388150 [13:48:00] T388150: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150 [13:48:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131010 (https://phabricator.wikimedia.org/T388150) (owner: 10Bking) [13:48:15] phuedx: in that case, can we do both 1130648 and 1130752 (tgr's) in one batch? they should not affect each other, and i don't expect that we'll need to revert either after the debug server testing [13:48:34] phuedx: i don't think they go faster any more… not with kubernetes deploys D: [13:48:44] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [13:48:45] well, i guess the CI job is faster [13:48:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2220 gradually with 4 steps - Pooling in after OS upgrade [13:48:55] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db2220 gradually with 4 steps - Pooling in after OS upgrade [13:49:25] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2220 gradually with 4 steps - Pooling in after OS upgrade [13:49:36] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130596|Restore deprecated aliases for CommentStoreComment and RawMessage (T388725)]] (duration: 15m 39s) [13:49:40] T388725: PHP Warning: Class __PHP_Incomplete_Class has no unserializer - https://phabricator.wikimedia.org/T388725 [13:49:44] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674127 (10elukey) @dcaro have you tried with the current setup.py's c... [13:50:37] (i generally feel like we should be deploying multiple changes at once most of the time… given that the deploys take half an hour each, and we can't do them in parallel, and we have so little time every day scheduled for them) [13:50:51] tgr_: Are you around / happy with MatmaRex's proposal? [13:51:09] i think he's in a meeting or something, but we talked earlier [13:51:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host phab1005.eqiad.wmnet [13:52:25] oh, and you're still waiting to deploy your own change. i'm sure that can tag along too [13:52:36] (03PS1) 10Ayounsi: CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) [13:52:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:52:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [13:52:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:53:13] I'll get yours done. Hopefully tgr_'s meeting load isn't terrible and they'll be around by the time it ends. When they're around, I'll bundle theirs with mine. Cool? [13:53:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 823.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:53:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130648 (https://phabricator.wikimedia.org/T388165) (owner: 10Bartosz Dziewoński) [13:54:09] sure. thanks [13:55:17] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [13:57:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2001.codfw.wmnet with OS bookworm [13:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:58:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 847ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:58:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1005.eqiad.wmnet [13:58:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674167 (10Jhancock.wm) [13:59:09] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389913#10674170 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alert already cleared [13:59:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [13:59:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74408 and previous config saved to /var/cache/conftool/dbconfig/20250325-135931-root.json [14:00:39] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10674173 (10dcaro) I'm installing wmcs-cookbooks, that should bring the... [14:00:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10674174 (10Jclark-ctr) [14:01:43] (03PS1) 10Arturo Borrero Gonzalez: openstack: rename lan-flat-cloudinstances2b to VLAN/legacy [puppet] - 10https://gerrit.wikimedia.org/r/1131013 (https://phabricator.wikimedia.org/T389942) [14:02:24] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [14:02:51] (03CR) 10Gehel: WIP: wdqs: Add alerts for no lag metrics reported (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1130730 (https://phabricator.wikimedia.org/T389859) (owner: 10Bking) [14:02:53] RECOVERY - BGP status on lsw1-b7-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10674184 (10Jclark-ctr) @bking these are finished except relforge1010 we are waiting on response from supermicro [14:03:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:03:41] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2009.codfw.wmnet with OS bookworm [14:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [14:05:14] (03CR) 10Bking: [C:03+2] "self-merging: trivial change and time sensitive" [puppet] - 10https://gerrit.wikimedia.org/r/1131010 (https://phabricator.wikimedia.org/T388150) (owner: 10Bking) [14:05:45] (03Merged) 10jenkins-bot: Fully silence TRX profiler after autocreation [core] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1130648 (https://phabricator.wikimedia.org/T388165) (owner: 10Bartosz Dziewoński) [14:06:10] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1130648|Fully silence TRX profiler after autocreation (T388165)]] [14:06:15] T388165: "Expectation not met" warnings during SUL autologin autocreation - https://phabricator.wikimedia.org/T388165 [14:09:01] !log rebooting cp4047 (T384227) [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:06] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [14:10:09] (03PS3) 10Filippo Giunchedi: Alert when mirrors become out of date [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:10:34] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I have renamed the alerts to not include crit/warn in the name as per best practice" [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:11:53] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [14:12:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [14:12:22] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [14:12:47] !log phuedx@deploy1003 phuedx, matmarex: Backport for [[gerrit:1130648|Fully silence TRX profiler after autocreation (T388165)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:51] T388165: "Expectation not met" warnings during SUL autologin autocreation - https://phabricator.wikimedia.org/T388165 [14:13:02] MatmaRex: Could you test your change please? [14:13:18] tgr_: Are you still in your meeting? [14:13:39] phuedx: seems okay [14:13:44] !log phuedx@deploy1003 phuedx, matmarex: Continuing with sync [14:13:51] Ack [14:14:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:14:15] (03CR) 10Thcipriani: [C:03+1] "Matches group ownership of MediaWiki deployment directories and other permissions related to Wikimedia deployment." [puppet] - 10https://gerrit.wikimedia.org/r/1130715 (owner: 10Ahmon Dancy) [14:14:17] we'll know for sure once the warnings stop appearing in production logs [14:14:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74410 and previous config saved to /var/cache/conftool/dbconfig/20250325-141437-root.json [14:14:40] Daimona, anzx: I can go a little long and get yours deployed together as they seem very low risk config changes. Can you stick around for a while? [14:15:02] (03CR) 10Thcipriani: [C:03+1] data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:15:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:15:20] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10674232 (10RobH) Ok, now they want me to flash the NVMe SSDs so I'll be doing this later today and we'll see if that eliminates the issue. [14:15:33] I'm in a meeting, so not actively looking here, but I'll be around for the rest of the day [14:15:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2255.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:16:21] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4047 is OK: HTTP OK: HTTP/1.0 200 OK - 35775 bytes in 7.523 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:16:21] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4047 is OK: HTTP OK: HTTP/1.1 200 OK - 47623 bytes in 7.547 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:17:03] RECOVERY - Ensure traffic_server is running for instance backend on cp4047 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:18:07] Daimona: OK. Please reschedule. I'll reschedule mine also and yield to Kemayo. I have meetings to get to soon [14:18:18] *fairly soon [14:18:35] Sure [14:18:39] FIRING: [4x] CoreBGPDown: Core BGP session down between asw1-b12-drmrs and cr1-drmrs (185.15.58.142) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:18:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2255.codfw.wmnet with OS bookworm [14:18:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with... [14:18:53] uh? [14:19:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:19:52] phuedx: I'm good to get started, then? [14:20:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008.eqiad.wmnet'] [14:20:14] Kemayo: Almost. Just waiting on the k8s part of the deployment to finish [14:20:37] (03PS1) 10Samtar: InitialiseSettings-labs: wgTemplateDataEnableDiscovery on beta.enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131016 (https://phabricator.wikimedia.org/T377975) [14:20:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [14:20:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reima... [14:21:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 975.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:22:33] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130648|Fully silence TRX profiler after autocreation (T388165)]] (duration: 16m 22s) [14:22:37] T388165: "Expectation not met" warnings during SUL autologin autocreation - https://phabricator.wikimedia.org/T388165 [14:22:50] jouncebot: now and next [14:22:50] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [14:23:01] !log UTC afternoon backport window finished [14:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] Kemayo: All yours [14:23:34] phuedx: Thanks! [14:24:29] thanks phuedx [14:25:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:25:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 7.555% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:25:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2259.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:25:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2259.codfw.wmnet with OS bookworm [14:26:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2259.codfw.wmnet with... [14:26:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 813.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:26:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:27:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2262.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:27:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2262.codfw.wmnet with OS bookworm [14:27:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674278 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2262.codfw.wmnet with... [14:28:12] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10674281 (10Jelto) >>! In T378922#10624358, @jcrespo wrote: > > Can we setup a meeting (E.g Jelto, Matthew and I) focusing on *req... [14:28:22] (03CR) 10Btullis: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:29:12] (03PS2) 10Scott French: mw-*: normalize the next and migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129959 (https://phabricator.wikimedia.org/T383845) [14:29:20] !log Impromptu Editing backport window started [14:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74412 and previous config saved to /var/cache/conftool/dbconfig/20250325-142942-root.json [14:30:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2255.codfw.wmnet with reason: host reimage [14:31:11] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:23] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131003 (https://phabricator.wikimedia.org/T373949) (owner: 10DLynch) [14:31:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131004 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [14:31:55] (03CR) 10Btullis: [C:03+1] "Cool, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:33:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:33:23] (03Merged) 10jenkins-bot: Edit check: add editcheck-references-shown to the allowed tags list [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131003 (https://phabricator.wikimedia.org/T373949) (owner: 10DLynch) [14:33:28] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:33:32] looking [14:33:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2255.codfw.wmnet with reason: host reimage [14:33:36] !incidents [14:33:36] 5795 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [14:33:37] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti-test2001.codfw.wmnet with OS bookworm [14:33:37] 5790 (RESOLVED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:37] 5788 (RESOLVED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:37] 5787 (RESOLVED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:37] 5789 (RESOLVED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:37] 5794 (RESOLVED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:38] 5793 (RESOLVED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:38] 5792 (RESOLVED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:39] 5791 (RESOLVED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [14:33:39] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [14:33:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (2001:12f8::221:197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:33:39] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [14:33:46] !ack 5790 [14:33:46] Attempt to ack incident 5790 failed. [14:34:00] !ack 5795 [14:34:01] 5795 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [14:34:14] !incidents [14:34:14] 5795 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [14:34:14] 5790 (RESOLVED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:15] 5788 (RESOLVED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:15] 5787 (RESOLVED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:15] 5789 (RESOLVED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:15] 5794 (RESOLVED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:15] 5793 (RESOLVED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:16] 5792 (RESOLVED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:16] 5791 (RESOLVED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [14:34:17] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [14:34:17] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [14:35:02] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2001.codfw.wmnet'] [14:35:30] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2220 gradually with 4 steps - Pooling in after OS upgrade [14:35:39] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: move prometheus k8s instances off prometheus2006 [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:35:52] jelto, sukhe, tappof, topranks: IX.BR issue [14:36:02] thanks [14:36:06] sorry meeting [14:36:17] (03PS1) 10Hubaishan: Allow arwikisource bureaucrat to manage "import" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) [14:36:34] some peers there (including HE) are down, but not all. The alerts were during the BGP convergence, alerts should clear up [14:36:40] peers are still down, but stable [14:36:53] okay, so the alert FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (2001:12f8::221:197 is related [14:36:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674346 (10bking) I've updated iDRAC, NIC, and BIOS to the latest firmware version, but... [14:37:00] (03CR) 10Clément Goubert: [C:03+1] mw-*: normalize the next and migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129959 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:37:27] traffic levels gone insane [14:37:29] jelto: yeah brand new alert :) [14:37:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2259.codfw.wmnet with reason: host reimage [14:38:16] problem on the fabric [14:38:18] ? [14:38:57] topranks: broadcast storm: https://librenms.wikimedia.org/graphs/to=1742913300/id=31635/type=port_nupkts/from=1742891700/ [14:39:22] !log move k8s instances from prometheus2006 to prometheus2008 - T383232 [14:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:26] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [14:39:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2262.codfw.wmnet with reason: host reimage [14:39:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 (owner: 10Slyngshede) [14:39:36] thanks, yeah that's exactly what I was wondering [14:39:50] seems to have levelled off now [14:40:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2259.codfw.wmnet with reason: host reimage [14:40:12] (03CR) 10Brouberol: [C:03+2] Add monitoring over the mediawiki dumps legacy CephFS PVC available space [alerts] - 10https://gerrit.wikimedia.org/r/1130952 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:40:17] !log enable puppet on A:cp (T384227) [14:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:22] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [14:41:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [14:41:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130659 (https://phabricator.wikimedia.org/T386428) (owner: 10Daimona Eaytoy) [14:43:20] (03Merged) 10jenkins-bot: Edit check: don't close the sidebar on context change on desktop [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131004 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [14:43:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2262.codfw.wmnet with reason: host reimage [14:43:46] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1131003|Edit check: add editcheck-references-shown to the allowed tags list (T373949)]], [[gerrit:1131004|Edit check: don't close the sidebar on context change on desktop (T389906)]] [14:43:51] T373949: Clarify the meaning of the editcheck-references-activated tag - https://phabricator.wikimedia.org/T373949 [14:43:51] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [14:44:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74414 and previous config saved to /var/cache/conftool/dbconfig/20250325-144447-root.json [14:45:04] XioNoX, topranks should we do anything else beside waiting until BGP catches up in magru? NELs still look a bit high in logstash as far as I can tell [14:46:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from BR) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [14:46:27] jelto: the spike seems to be over. If it's still haven't cleared in 5/10min we can.... well here it is ^ [14:47:18] ack :) then let's also wait for the NELHigh pag.e which hopefully resolves as well [14:47:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:47:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [14:47:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:48:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:49:28] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:49:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:49:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2255.codfw.wmnet with OS bookworm [14:49:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674397 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2255.codfw.wmnet with OS... [14:50:18] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1131003|Edit check: add editcheck-references-shown to the allowed tags list (T373949)]], [[gerrit:1131004|Edit check: don't close the sidebar on context change on desktop (T389906)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:23] T373949: Clarify the meaning of the editcheck-references-activated tag - https://phabricator.wikimedia.org/T373949 [14:50:24] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [14:52:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:53:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:53:22] !log kemayo@deploy1003 kemayo: Continuing with sync [14:53:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:53:25] XioNoX: I was adding the packet-type stats to grafana there cos it was only showing the overall [14:53:34] seems it was a mixture of broadcast and unicast [14:53:34] https://grafana.wikimedia.org/goto/WrVCfvTNR [14:53:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (2001:12f8::221:197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:53:56] nice [14:54:00] !incidents [14:54:01] 5795 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [14:54:01] 5790 (RESOLVED) db2221 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:01] 5788 (RESOLVED) db2222 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:02] 5787 (RESOLVED) db2182 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:02] 5789 (RESOLVED) db2150 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:02] 5794 (RESOLVED) db2168 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:02] 5793 (RESOLVED) db2218 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:03] 5792 (RESOLVED) db2208 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:03] 5791 (RESOLVED) db2159 (paged)/MariaDB Replica Lag: s7 (paged) [14:54:03] 5786 (RESOLVED) db2142 (paged)/MariaDB Replica IO: ms1 (paged) [14:54:04] 5780 (RESOLVED) db2179 (paged)/MariaDB Replica Lag: s4 (paged) [14:54:05] kind of strange tbh, I guess maybe the broadcasts knocked everyone's BGP sessions off causing a flood of unicast reconnections or something [14:54:15] yeah I'd guess [14:54:35] (03CR) 10Muehlenhoff: [C:03+2] Apply maps master role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1130983 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:54:52] the bigger limits we have for ARPs now on that port probably don't help, I see we missed some snmp polls [14:54:58] not sure what else we can do there though [14:55:27] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:55:56] jelto: just fyi with this type of thing the NELs should settle down in a few mins anyway - even if our BGP sessions at IX.BR stayed down [14:56:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:56:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2259.codfw.wmnet with OS bookworm [14:56:04] (03CR) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:56:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2259.codfw.wmnet with OS... [14:56:36] NEL's are unavoidable when the traffic path takes a hit, but should settle down once it decides on another one (even if the original links remain down etc) [14:56:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2001.codfw.wmnet'] [14:57:00] (03CR) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:57:40] okay thanks for the additional context , makes sense to me [14:57:48] (03PS4) 10Brouberol: Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) [14:58:00] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host atlas5001.wikimedia.org [14:58:02] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [14:58:16] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:58:25] (03CR) 10Btullis: [C:03+1] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [14:58:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:58:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2262.codfw.wmnet with OS bookworm [14:58:42] .37 [14:58:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674450 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2262.codfw.wmnet with OS... [14:58:46] god dammit finger [14:59:29] (03CR) 10Brennen Bearnes: [C:03+1] Phabricator: Remove unused fixed_settings.yaml stuff; update README [puppet] - 10https://gerrit.wikimedia.org/r/1130323 (https://phabricator.wikimedia.org/T239355) (owner: 10Aklapper) [14:59:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [14:59:49] (03CR) 10Brouberol: [C:03+2] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130642 (https://phabricator.wikimedia.org/T389762) (owner: 10Brouberol) [15:00:04] jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1500). [15:00:07] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2007 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131022 (https://phabricator.wikimedia.org/T387854) [15:00:24] !log finished moving k8s instances to prometheus2008 - T383232 [15:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:28] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [15:01:09] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957 (10bking) 03NEW [15:01:16] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131003|Edit check: add editcheck-references-shown to the allowed tags list (T373949)]], [[gerrit:1131004|Edit check: don't close the sidebar on context change on desktop (T389906)]] (duration: 17m 30s) [15:01:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:22] T373949: Clarify the meaning of the editcheck-references-activated tag - https://phabricator.wikimedia.org/T373949 [15:01:22] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [15:01:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2265.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:30] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924#10674483 (10Scott_French) 05Open→03Resolved a:03Scott_French Followed up with @acooper out of band: In short, the guidance in the self-service access request flow in Bitu prompts th... [15:01:40] (03PS1) 10Bking: relforge: bring new hosts online [puppet] - 10https://gerrit.wikimedia.org/r/1131023 (https://phabricator.wikimedia.org/T389957) [15:01:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:01:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2263.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:01:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:02:00] !log Impromptu Editing backport window finished [15:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: add hetzner and netcup [puppet] - 10https://gerrit.wikimedia.org/r/1131024 [15:03:34] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve2007 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131022 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [15:03:48] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas5001.wikimedia.org - ayounsi@cumin1002" [15:03:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas5001.wikimedia.org - ayounsi@cumin1002" [15:03:53] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:53] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache atlas5001.wikimedia.org on all recursors [15:03:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas5001.wikimedia.org on all recursors [15:04:01] (03PS1) 10Clément Goubert: alertmanager: Add mediawiki-platform-task [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) [15:04:22] (03PS1) 10Ahmon Dancy: P:idp Limit groups sent from CAS to Spiderpig [puppet] - 10https://gerrit.wikimedia.org/r/1131026 (https://phabricator.wikimedia.org/T389869) [15:04:23] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas5001.wikimedia.org - ayounsi@cumin1002" [15:04:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas5001.wikimedia.org - ayounsi@cumin1002" [15:04:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas5001.wikimedia.org [15:04:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131023 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [15:04:47] !log jmm@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:05:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 803ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131016 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:47] (03Merged) 10jenkins-bot: InitialiseSettings-labs: wgTemplateDataEnableDiscovery on beta.enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131016 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [15:07:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:07:20] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1131024 (owner: 10Giuseppe Lavagetto) [15:09:03] (03CR) 10Giuseppe Lavagetto: [C:03+2] external_clouds_vendors: add hetzner and netcup [puppet] - 10https://gerrit.wikimedia.org/r/1131024 (owner: 10Giuseppe Lavagetto) [15:09:28] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS bookworm [15:09:53] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2007 [15:10:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 803.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:11:08] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [15:11:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2264.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:12:03] (03CR) 10Bking: [C:03+2] relforge: bring new hosts online [puppet] - 10https://gerrit.wikimedia.org/r/1131023 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [15:12:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:12:50] (03CR) 10Bking: [C:03+2] "self-merging. The PCC failure seems related to the newness of the host, which was brought online less than 24 hours and thus probably does" [puppet] - 10https://gerrit.wikimedia.org/r/1131023 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [15:12:59] <_joe_> uh that doesn't look good (the mw errors being high, latency too) [15:13:23] _joe_: saturated [15:13:36] we're at 85% workers utilization [15:14:01] <_joe_> claime: which is a consequence of the latency [15:14:05] yeah [15:15:41] (03PS1) 10Bking: Revert "relforge: bring new hosts online" [puppet] - 10https://gerrit.wikimedia.org/r/1131027 [15:15:51] Erm that's been going on for a while [15:16:01] (03CR) 10Bking: [V:03+2 C:03+2] Revert "relforge: bring new hosts online" [puppet] - 10https://gerrit.wikimedia.org/r/1131027 (owner: 10Bking) [15:16:03] latency has bee up since 0930/0945 this mornign [15:16:06] morning* [15:16:22] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2007 - elukey@cumin1002" [15:16:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2007 - elukey@cumin1002" [15:16:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:28] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2007.codfw.wmnet 78.32.192.10.in-addr.arpa 8.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:16:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2007.codfw.wmnet 78.32.192.10.in-addr.arpa 8.7.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:16:32] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2007 [15:16:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2265.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:16:53] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2007 [15:16:53] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2007 [15:17:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:17:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2263.codfw.wmnet with OS bookworm [15:17:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2264.codfw.wmnet with OS bookworm [15:17:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2265.codfw.wmnet with OS bookworm [15:17:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2263.codfw.wmnet with... [15:17:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2264.codfw.wmnet with... [15:17:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2265.codfw.wmnet with... [15:17:54] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 (10cmooney) 03NEW p:05Triage→03Medium [15:18:40] MySQL open connections are way up compared to yesterday, starting at around that time as well [15:20:36] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674586 (10cmooney) @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining steps? Unlike the previous attempt I will lea... [15:21:03] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10674592 (10bking) After adding `relforge1009` to its production roles, I see a Puppet error ` Error: Could not retrieve catalog from remote... [15:21:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:21:23] (03CR) 10Muehlenhoff: [C:03+2] Apply maps/replica role to maps-test2002 [puppet] - 10https://gerrit.wikimedia.org/r/1130984 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:23:08] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 124.68 ms [15:25:42] 06SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-Debug-Logger, 06MediaWiki-Platform-Team: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887#10674631 (10Scott_French) Per netbox, these are indeed Wikimedia Cloud IPs. > Do we need to add them to the list of IPs that are trus... [15:26:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:26:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:26:45] (03PS1) 10Filippo Giunchedi: karma: strip sre-irc receiver if duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1131028 (https://phabricator.wikimedia.org/T353457) [15:28:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2001.codfw.wmnet with OS bookworm [15:28:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:28:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2263.codfw.wmnet with reason: host reimage [15:29:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2264.codfw.wmnet with reason: host reimage [15:29:17] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 396032 [15:29:18] (03CR) 10Filippo Giunchedi: [C:03+1] Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [15:29:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2265.codfw.wmnet with reason: host reimage [15:29:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 396032 [15:30:37] 06SRE, 10MediaWiki-Core-AuthManager, 10MediaWiki-Debug-Logger, 06MediaWiki-Platform-Team: Throttler IP logging uses internal IPs - https://phabricator.wikimedia.org/T389887#10674642 (10kostajh) > Are you looking to resolve these IPs to specific VPS instances / bots and in order to get confirmation from th... [15:30:58] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2002.codfw.wmnet [15:31:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2263.codfw.wmnet with reason: host reimage [15:32:24] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updare frack node to use new mgmt subnet 10.195.1.1/25 - pt1979@cumin2002" [15:32:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updare frack node to use new mgmt subnet 10.195.1.1/25 - pt1979@cumin2002" [15:32:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2265.codfw.wmnet with reason: host reimage [15:35:50] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti-test2001.codfw.wmnet with OS bookworm [15:36:09] !log pt1979@cumin2002 START - Cookbook sre.dns.wipe-cache frbast2002.mgmt.frack.codfw.wmnet on all recursors [15:36:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) frbast2002.mgmt.frack.codfw.wmnet on all recursors [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2264.codfw.wmnet with reason: host reimage [15:38:21] (03CR) 10Filippo Giunchedi: "See inline, other than that LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [15:39:11] 10ops-codfw, 06SRE, 06DC-Ops: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468#10674710 (10Papaul) frbast2002 complete [15:40:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.eqiad.wmnet with OS bullseye [15:41:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10674712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cu... [15:41:26] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [15:43:22] (03CR) 10Ahmon Dancy: add hiera keys needed since spiderpig includes envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130633 (owner: 10Dzahn) [15:45:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:46:22] (03PS1) 10Volans: interactive: add NullHandler to the notify logger [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131032 [15:46:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:48:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:48:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2263.codfw.wmnet with OS bookworm [15:48:27] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: name=ml-serve2003.codfw.wmnet,dc=codfw,cluster=maps,service=inference [15:48:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2263.codfw.wmnet with OS... [15:49:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:49:39] (03CR) 10Jobo: [C:03+1] data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:49:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:49:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2265.codfw.wmnet with OS bookworm [15:50:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674729 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2265.codfw.wmnet with OS... [15:50:13] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve2008 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131033 (https://phabricator.wikimedia.org/T387854) [15:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:50:53] (03CR) 10Klausman: [C:03+1] role::ml_k8s::worker: move ml-serve2008 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131033 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [15:51:04] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10674734 (10cmooney) Config to be applied in first step - P74416 [15:51:26] (03CR) 10Dzahn: [C:03+2] add hiera keys needed since spiderpig includes envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130633 (owner: 10Dzahn) [15:52:45] (03CR) 10Muehlenhoff: [C:03+2] data.yaml: Allow journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1130138 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:53:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:53:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:53:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2264.codfw.wmnet with OS bookworm [15:53:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2264.codfw.wmnet with OS... [15:54:01] (03PS4) 10Muehlenhoff: Enable maps-test2003 to maps-test2006 as additional maps bookworm replicas [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) [15:54:13] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5150/" [puppet] - 10https://gerrit.wikimedia.org/r/1131033 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [15:54:49] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::worker: move ml-serve2008 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131033 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [15:57:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:59:49] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve2007.codfw.wmnet with OS bookworm [16:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:21] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS bookworm [16:02:08] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-serve2008.codfw.wmnet with OS bookworm [16:02:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2266.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:02:37] !log klausman@cumin2002 START - Cookbook sre.hosts.move-vlan for host ml-serve2008 [16:02:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2268.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2267.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:01] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [16:03:06] (03CR) 10Elukey: [C:03+1] interactive: add NullHandler to the notify logger [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131032 (owner: 10Volans) [16:04:53] (03CR) 10Elukey: [C:03+1] "LGTM, checked also the IPs." [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:05:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130981 (https://phabricator.wikimedia.org/T388955) (owner: 10Anzx) [16:05:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2266.codfw.wmnet with OS bookworm [16:05:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2267.codfw.wmnet with OS bookworm [16:06:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2268.codfw.wmnet with OS bookworm [16:06:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674813 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2266.codfw.wmnet with... [16:06:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2267.codfw.wmnet with... [16:06:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10674815 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2268.codfw.wmnet with... [16:09:16] (03CR) 10Kamila Součková: [C:03+2] benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [16:13:33] (03PS1) 10Btullis: [airflow] - Increase the limit on the maximum number of mapped tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131036 (https://phabricator.wikimedia.org/T389773) [16:13:55] (03PS1) 10Robertsky: T389729 - updating wikimaniawiki namespace configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 [16:14:57] (03Merged) 10jenkins-bot: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [16:15:58] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/benthos-mw-accesslog-metrics: apply [16:15:59] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/benthos-mw-accesslog-metrics: apply [16:17:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2267.codfw.wmnet with reason: host reimage [16:17:28] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10674890 (10BCornwall) That seems likely, yes. Their investigation isn't helped by the fact that I set all the fan offsets, lowering the temperatures. Yeah, the inlet temps are fine but the issue is the CPU... [16:17:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2266.codfw.wmnet with reason: host reimage [16:17:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2268.codfw.wmnet with reason: host reimage [16:18:24] !log updating ssd firmware on cp4047 via T387238 [16:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238 [16:18:34] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [16:18:43] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/benthos-mw-accesslog-metrics: apply [16:18:44] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/benthos-mw-accesslog-metrics: apply [16:19:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2267.codfw.wmnet with reason: host reimage [16:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:20:15] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:21:04] PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.99% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:22:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [16:23:22] (03PS1) 10Cathal Mooney: Remove include statement for 10.195.0.96/27 from 10.in-addr.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1131040 (https://phabricator.wikimedia.org/T371468) [16:25:06] (03CR) 10Ssingh: [C:03+1] Remove include statement for 10.195.0.96/27 from 10.in-addr.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1131040 (https://phabricator.wikimedia.org/T371468) (owner: 10Cathal Mooney) [16:25:17] (03CR) 10Cathal Mooney: [C:03+2] Remove include statement for 10.195.0.96/27 from 10.in-addr.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/1131040 (https://phabricator.wikimedia.org/T371468) (owner: 10Cathal Mooney) [16:25:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2266.codfw.wmnet with reason: host reimage [16:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10674953 (10phaultfinder) [16:25:44] !log cmooney@dns2005 START - running authdns-update [16:25:47] !log kamila@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:26:07] RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [16:26:23] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468#10674955 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T371468#10674710, @Papaul wrote: > frbast2002 complete Thanks for the work on this!!! [16:26:56] (03CR) 10BCornwall: [C:03+2] upgrade cp5017 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130731 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:27:05] (03CR) 10BCornwall: [C:03+2] upgrade cp5031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130745 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:27:16] (03PS2) 10BCornwall: upgrade cp5031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130745 (https://phabricator.wikimedia.org/T378737) [16:27:23] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130745 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:27:36] !log cmooney@dns2005 END - running authdns-update [16:27:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:28:28] (03CR) 10Filippo Giunchedi: pdu_config_netbox: add new module to grab PDUs from netbox (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:28:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2268.codfw.wmnet with reason: host reimage [16:28:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5017.eqsin.wmnet} and A:cp [16:28:58] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5031.eqsin.wmnet} and A:cp [16:29:08] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10674965 (10RobH) PCIE SSD updated to 2.5.0 @BCornwall: Can we return this to tenative service post update and see if it throws errors again? I have no errors in the idrac log from the last round of... [16:30:06] !log klausman@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2008 - klausman@cumin2002" [16:30:11] !log klausman@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2008 - klausman@cumin2002" [16:30:11] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:12] !log klausman@cumin2002 START - Cookbook sre.dns.wipe-cache ml-serve2008.codfw.wmnet 175.48.192.10.in-addr.arpa 5.7.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:30:15] !log klausman@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2008.codfw.wmnet 175.48.192.10.in-addr.arpa 5.7.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:30:16] !log klausman@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2008 [16:30:38] !log klausman@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2008 [16:30:38] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2008 [16:30:46] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10674973 (10Fabfur) Chiming in to say that currently we're using this to deploy TLS certificates "the new way" (see T384227), no issues for me to repool it anyway [16:32:04] (03PS1) 10Kamila Součková: Revert "benthos-mw-accesslog-metrics: create deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131044 [16:32:30] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [16:33:04] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [16:33:16] !log brennen@deploy1003 Started deploy [phabricator/deployment@f01e475]: deploy phab2002 for T389953 [16:33:21] T389953: Deploy Phabricator/Phorge 2025-03-25 - https://phabricator.wikimedia.org/T389953 [16:33:55] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f01e475]: deploy phab2002 for T389953 (duration: 00m 39s) [16:34:05] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:15] !log brennen@deploy1003 Started deploy [phabricator/deployment@f01e475]: deploy phab1004 for T389953 [16:34:52] !log brennen@deploy1003 Finished deploy [phabricator/deployment@f01e475]: deploy phab1004 for T389953 (duration: 00m 36s) [16:35:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:35:43] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5031.eqsin.wmnet} and A:cp [16:36:01] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5017.eqsin.wmnet} and A:cp [16:36:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:36:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2267.codfw.wmnet with OS bookworm [16:36:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2267.codfw.wmnet with OS... [16:37:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 20.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:38:44] (03CR) 10Kamila Součková: [C:03+2] Revert "benthos-mw-accesslog-metrics: create deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131044 (owner: 10Kamila Součková) [16:39:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2007.codfw.wmnet with OS bookworm [16:40:13] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:43:45] (03Merged) 10jenkins-bot: Revert "benthos-mw-accesslog-metrics: create deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131044 (owner: 10Kamila Součková) [16:45:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:45:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2266.codfw.wmnet with OS bookworm [16:45:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:45:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2266.codfw.wmnet with OS... [16:47:05] (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) (owner: 10Hubaishan) [16:48:22] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [16:48:53] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10675061 (10RobH) >>! In T387238#10674897, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/a0oZzpUB8tZ8Ohr0EoF1} [2025-03-25T16:18:24Z... [16:49:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:49:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2268.codfw.wmnet with OS bookworm [16:49:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2268.codfw.wmnet with OS... [16:50:32] (03CR) 10Superpes15: [C:03+1] Allow arwikisource bureaucrat to manage "import" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) (owner: 10Hubaishan) [16:50:55] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10675065 (10fnegri) a:05Andrew→03fnegri I'm back from holidays and I'm re-assigning this task to myself. @aborrero is quite confident that upda... [16:51:47] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [16:53:50] (03PS1) 10Nik Gkountas: Add all language codes to SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) [16:54:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:58:04] jouncebot: nowandnext [16:58:04] For the next 0 hour(s) and 1 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1600) [16:58:04] In 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1700) [16:58:11] jouncebot: refresh [16:58:12] I refreshed my knowledge about deployments. [16:59:48] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10675101 (10BCornwall) Linux is still not happy: ` Mar 25 16:28:02 cp4047 kernel: blk_update_request: critical target error, dev nvme0c0n1, sector 0 op 0x0:(READ) flags 0x2000000 phys_seg 1 prio class 0... [17:00:05] swfrench-wmf: Your horoscope predicts another MediaWiki infrastructure (UTC late) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T1700). [17:00:52] (03PS1) 10FNegri: Failover all dumps traffic to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) [17:01:06] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:37] o/ [17:02:03] (03PS1) 10Fabfur: haproxy: use volatile storage for 2 hosts on magru [puppet] - 10https://gerrit.wikimedia.org/r/1131052 (https://phabricator.wikimedia.org/T384227) [17:02:07] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129959 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:02:09] (03CR) 10Scott French: [C:03+2] mw-*: normalize the next and migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129959 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:02:15] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131052 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [17:02:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2270 to codfw - jhancock@cumin2002" [17:02:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2270 to codfw - jhancock@cumin2002" [17:02:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:03:01] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2269 [17:03:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2270 [17:03:03] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2271 [17:03:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2269 [17:03:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker2271 [17:03:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2270 [17:03:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2271 [17:03:25] (03CR) 10BCornwall: [C:03+2] upgrade cp5018 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130732 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:03:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2271 [17:03:33] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130744 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:03:38] (03PS2) 10Clément Goubert: mw::periodic_job: Migrate blameStartupRegistry.php [puppet] - 10https://gerrit.wikimedia.org/r/1131037 (https://phabricator.wikimedia.org/T388540) [17:03:42] (03Merged) 10jenkins-bot: mw-*: normalize the next and migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129959 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:03:46] (03PS2) 10BCornwall: upgrade cp5030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130744 (https://phabricator.wikimedia.org/T378737) [17:03:52] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130744 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:04:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:04:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2270.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:04:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:04:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:04:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:06] PROBLEM - BGP status on lsw1-d5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:05:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) (owner: 10Nik Gkountas) [17:05:40] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5030.eqsin.wmnet} and A:cp [17:05:42] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5018.eqsin.wmnet} and A:cp [17:06:06] RECOVERY - BGP status on lsw1-d5-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:15] !log swfrench@deploy1003 Started scap sync-world: Helmfile-only deployment for next and migration release cleanups - T383845 [17:06:19] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:07:14] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2008.codfw.wmnet with OS bookworm [17:07:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10675138 (10phaultfinder) [17:07:47] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1130631 (owner: 10Gehel) [17:08:07] !log swfrench@deploy1003 Finished scap sync-world: Helmfile-only deployment for next and migration release cleanups - T383845 (duration: 02m 45s) [17:08:30] alright, I'm all done with the infra window today [17:09:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2269.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:11:21] !log sudo systemctl restart pybal on lvs2014 [17:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:12:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:12:19] farmer in the dell [17:12:26] {◕ ◡ ◕} [17:12:28] * sukhe whistles [17:12:34] haha [17:12:36] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5030.eqsin.wmnet} and A:cp [17:12:37] brett: cp5030 and 5018 [17:12:40] puppet failure [17:13:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10675179 (10phaultfinder) [17:13:41] sukhe: <3 (pybal restarts) [17:13:57] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5018.eqsin.wmnet} and A:cp [17:14:23] elukey: <3 [17:14:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2270.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:14:50] !log restart pybal on lvs2013 [17:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:18] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:16:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:16:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2270.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:16:49] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2007.codfw.wmnet [17:17:24] (03PS1) 10Giuseppe Lavagetto: admin: Add simple function to read mw access logs to my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1131054 [17:18:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:19:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:20:18] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2007.codfw.wmnet [17:23:18] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10675211 (10RobH) >>! In T387238#10675101, @BCornwall wrote: > Linux is still not happy: Duly noted and coped over to the support ticket, I'll try to get them to send the replacement SSD without more tr... [17:24:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2270.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:24:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2271.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:26:04] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2009.codfw.wmnet [17:26:09] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2010.codfw.wmnet [17:26:14] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-serve2011.codfw.wmnet [17:27:09] !log klausman@cumin2002 conftool action : set/weight=1; selector: name=ml-serve2011.codfw.wmnet [17:27:25] !log klausman@cumin2002 conftool action : set/weight=1; selector: name=ml-serve2010.codfw.wmnet [17:27:31] !log klausman@cumin2002 conftool action : set/weight=1; selector: name=ml-serve2009.codfw.wmnet [17:28:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2269.codfw.wmnet with OS bookworm [17:28:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2270.codfw.wmnet with OS bookworm [17:28:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2271.codfw.wmnet with OS bookworm [17:28:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2269.codfw.wmnet with... [17:28:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675224 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2270.codfw.wmnet with... [17:28:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2271.codfw.wmnet with... [17:28:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10675227 (10phaultfinder) [17:34:30] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:39:16] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10675276 (10bking) p:05Triage→03Low a:05bking→03None [17:39:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2270.codfw.wmnet with reason: host reimage [17:39:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2271.codfw.wmnet with reason: host reimage [17:39:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2269.codfw.wmnet with reason: host reimage [17:40:18] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10675284 (10bking) This is low priority, so moving back to the operations backlog... [17:41:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:42:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2270.codfw.wmnet with reason: host reimage [17:43:03] (03CR) 10Jgreen: [C:03+1] community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [17:43:45] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973 (10phaultfinder) 03NEW [17:45:15] (03CR) 10Jgreen: [C:03+1] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:45:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2271.codfw.wmnet with reason: host reimage [17:46:20] (03CR) 10BCornwall: [C:03+2] upgrade cp5019 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130733 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:46:33] (03PS2) 10BCornwall: upgrade cp5029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130743 (https://phabricator.wikimedia.org/T378737) [17:46:37] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130743 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:47:46] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5019.eqsin.wmnet} and A:cp [17:47:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5029.eqsin.wmnet} and A:cp [17:48:42] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10675330 (10phaultfinder) [17:48:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2269.codfw.wmnet with reason: host reimage [17:49:29] (03CR) 10Jgreen: [C:03+1] community_civicrm: add stub for dovecot_passwd [labs/private] - 10https://gerrit.wikimedia.org/r/1124204 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [17:49:30] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:52:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10675355 (10phaultfinder) [17:54:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5019.eqsin.wmnet} and A:cp [17:54:37] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5029.eqsin.wmnet} and A:cp [17:54:41] (03CR) 10Jforrester: T389729 - updating wikimaniawiki namespace configurations (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (owner: 10Robertsky) [17:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:57:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:58:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:58:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2270.codfw.wmnet with OS bookworm [17:58:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2270.codfw.wmnet with OS... [18:00:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:04:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:04:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2271.codfw.wmnet with OS bookworm [18:04:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:04:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2271.codfw.wmnet with OS... [18:05:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:05:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2269.codfw.wmnet with OS bookworm [18:05:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675453 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2269.codfw.wmnet with OS... [18:05:49] (03CR) 10David Caro: "Got a question, LGTM otherwise 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) (owner: 10FNegri) [18:06:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10675454 (10Jhancock.wm) [18:06:59] (03PS1) 10DCausse: Add opensearch-knn [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1131068 (https://phabricator.wikimedia.org/T389812) [18:07:38] (03CR) 10David Caro: Failover all dumps traffic to clouddumps1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) (owner: 10FNegri) [18:09:03] (03CR) 10David Caro: Failover all dumps traffic to clouddumps1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) (owner: 10FNegri) [18:09:08] (03CR) 10David Caro: [C:03+1] Failover all dumps traffic to clouddumps1002 [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) (owner: 10FNegri) [18:17:04] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10675490 (10phaultfinder) [18:20:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:25:38] !log Depooling cp3066 for varnishkafka testing (T389978) [18:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:43] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [18:25:43] T389978: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978 [18:29:16] PROBLEM - statsv Varnishkafka log producer on cp3066 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:29:51] !log dancy@deploy1003 Installing scap version "4.144.0" for 2 host(s) [18:31:37] !log dancy@deploy1003 Installation of scap version "4.144.0" completed for 2 hosts [18:33:11] (03CR) 10BCornwall: [C:03+2] upgrade cp5020 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130734 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:33:25] (03PS2) 10BCornwall: upgrade cp5028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130742 (https://phabricator.wikimedia.org/T378737) [18:33:33] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130742 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:33:39] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131072 [18:35:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:35:22] (03CR) 10Ebernhardson: [C:03+1] Add opensearch-knn [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1131068 (https://phabricator.wikimedia.org/T389812) (owner: 10DCausse) [18:39:16] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp1100.eqiad.wmnet [reason: testing varnish stuff] [18:42:03] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1100.eqiad.wmnet [reason: testing varnish stuff] [18:47:16] RECOVERY - statsv Varnishkafka log producer on cp3066 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:47:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:47:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:48:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:48:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:48:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:53:25] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5020.eqsin.wmnet} and A:cp [18:53:26] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5028.eqsin.wmnet} and A:cp [18:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:57:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128921 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [18:59:32] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5028.eqsin.wmnet} and A:cp [19:00:33] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5020.eqsin.wmnet} and A:cp [19:05:57] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:10:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2277 to codfw - jhancock@cumin2002" [19:11:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2277 to codfw - jhancock@cumin2002" [19:11:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:52] !log dancy@deploy1003 Installing scap version "4.144.1" for 2 host(s) [19:12:18] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:13:38] !log dancy@deploy1003 Installation of scap version "4.144.1" completed for 2 hosts [19:15:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2278-80 to codfw - jhancock@cumin2002" [19:16:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2278-80 to codfw - jhancock@cumin2002" [19:16:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:55] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2277 [19:16:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2278 [19:16:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2279 [19:16:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2280 [19:17:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2277 [19:17:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2278 [19:17:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2279 [19:17:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2280 [19:18:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2277.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:18:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:18:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:18:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:20:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:24:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10675856 (10phaultfinder) [19:25:54] 10ops-codfw, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389990 (10phaultfinder) 03NEW [19:25:55] 10ops-codfw, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389991 (10phaultfinder) 03NEW [19:26:57] 10ops-eqiad, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389992 (10phaultfinder) 03NEW [19:29:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2277.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:29:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:29:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:29:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:29:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2277.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:31:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:31:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:31:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:33:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 8.845% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:37:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2277.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:38:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:38:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:42:57] (03CR) 10BCornwall: [C:03+2] upgrade cp5021 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130735 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:43:08] (03PS2) 10BCornwall: upgrade cp5027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130741 (https://phabricator.wikimedia.org/T378737) [19:43:13] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130741 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:45:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:45:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2275.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2274.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2272.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:46:51] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5021.eqsin.wmnet} and A:cp [19:46:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5027.eqsin.wmnet} and A:cp [19:47:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2273.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2276.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:50:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:53:41] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5027.eqsin.wmnet} and A:cp [19:53:59] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5021.eqsin.wmnet} and A:cp [19:55:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [19:55:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:57:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1194 - https://phabricator.wikimedia.org/T389751#10675984 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate ticket T389065 [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T2000). nyaa~ [20:00:05] ryankemper, HouseOfM, tgr, anzx, stephanebisson, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:13] o/ [20:00:15] o/ [20:00:17] o/ [20:01:22] o/ [20:02:59] o/ [20:04:56] I can deploy in ten minutes if no one comes sooner [20:05:36] 10ops-codfw, 06SRE, 06DC-Ops: Renumber frack server mgmt IPs in codfw - https://phabricator.wikimedia.org/T371468#10675996 (10Papaul) I remove the last bit on the reth0 interface ` [edit interfaces reth0 unit 2140 family inet address 10.195.1.1/25] - /* T371468 */ - preferred; [edit interface... [20:07:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2281 to codfw - jhancock@cumin2002" [20:07:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2281 to codfw - jhancock@cumin2002" [20:07:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2283 [20:09:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2284 [20:09:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2281 [20:09:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2282 [20:09:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2283 [20:09:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2281 [20:09:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2282 [20:09:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2284 [20:10:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:10:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:10:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:10:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2284.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:15:28] (03CR) 10BryanDavis: [C:03+1] "I honestly think this exclusion should be the default. Triggering authorization on membership in a specific Toolforge tool is a thing that" [puppet] - 10https://gerrit.wikimedia.org/r/1131026 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [20:16:22] No deployers around? [20:20:20] (03PS1) 10Bking: elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) [20:21:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:44] Guess I'll self-deploy our change [20:21:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2284.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:22:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:22:37] tgr_: not available after all? [20:22:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:22:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:22:56] in a sec [20:23:02] but feel free to start if you can [20:23:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2284.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:23:07] (03PS2) 10Bking: elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) [20:23:25] Starting mine [20:24:02] (03CR) 10Ryan Kemper: [C:03+2] wdqs-categories: remove extraneous wgCirrusSearchCategoryEndpoint value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking) [20:24:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ryankemper@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) (owner: 10Ryan Kemper) [20:25:39] (03Merged) 10jenkins-bot: wdqs categories: switch to internal-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124535 (https://phabricator.wikimedia.org/T375520) (owner: 10Ryan Kemper) [20:26:26] !log ryankemper@deploy1003 Started scap sync-world: Backport for [[gerrit:1124535|wdqs categories: switch to internal-main (T375520 T385896 T337013)]] [20:26:34] T375520: EPIC: WDQS categories migration - https://phabricator.wikimedia.org/T375520 [20:26:35] T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896 [20:26:35] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [20:27:18] All this automation is neat, last time I deployed my own backport it was all manual scap commands :D [20:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:28:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2284.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:28:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:28:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:28:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:29:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:32:41] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10676078 (10bd808) @toni.stoev Please read https://www.mediawiki.org/wiki/Special:MyLanguage/Code_of_Conduc... [20:33:11] !log ryankemper@deploy1003 ryankemper: Backport for [[gerrit:1124535|wdqs categories: switch to internal-main (T375520 T385896 T337013)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:33:18] T375520: EPIC: WDQS categories migration - https://phabricator.wikimedia.org/T375520 [20:33:18] T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896 [20:33:18] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [20:33:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:34:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:35:17] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:35:30] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:35:36] Sadly, I need to leave. [20:38:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:39:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2285-88 to codfw - jhancock@cumin2002" [20:39:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2285-88 to codfw - jhancock@cumin2002" [20:39:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:04] !log running sendBulkEmail.php as per T389064#10676087 [20:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064 [20:40:21] !log T385896 Got successful deepcat search of `deepcat:"musicals"` on `en.wikipedia.org` with `X-Wikimedia-Debug:backend=k8s-mwdebug`; rolling out change fully now [20:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:26] T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896 [20:40:27] !log ryankemper@deploy1003 ryankemper: Continuing with sync [20:41:49] !log gmodena@deploy1003 Started deploy [airflow-dags/search@af7e28f]: Deploying mjolnir 2.6.0 [20:42:34] !log gmodena@deploy1003 Finished deploy [airflow-dags/search@af7e28f]: Deploying mjolnir 2.6.0 (duration: 01m 00s) [20:43:04] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2285 [20:43:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2286 [20:43:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2287 [20:43:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2288 [20:43:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2285 [20:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:43:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2286 [20:43:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2287 [20:43:25] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#10676096 (10Jclark-ctr) a:03Jclark-ctr [20:43:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2288 [20:44:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2285.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:44:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:45:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:45:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:48:06] !log ryankemper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1124535|wdqs categories: switch to internal-main (T375520 T385896 T337013)]] (duration: 21m 40s) [20:48:12] T375520: EPIC: WDQS categories migration - https://phabricator.wikimedia.org/T375520 [20:48:13] T385896: Deploy wdqs-categories on wdqs-main/wdqs-internal-main hosts - https://phabricator.wikimedia.org/T385896 [20:48:13] T337013: [Epic] Splitting the graph in WDQS - https://phabricator.wikimedia.org/T337013 [20:49:30] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:50:17] ryankemper: If you're done, I could get mine next? [20:50:42] Kemayo: yup I'm all finished up, feel free to go ahead (cc tgr) [20:51:00] cc tgr_ * [20:51:30] Kemayo: are you deploying? [20:51:43] tgr_: I can get mine, at least. [20:51:57] ok [20:52:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128921 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:52:10] I can do the rest if the next window is free [20:52:24] I've never actually run a deployment window, so I'm being cautious there. [20:52:42] I'm back now so can help if needed [20:52:54] (03Merged) 10jenkins-bot: Enable VisualEditor EditCheck multi-check a/b test on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128921 (https://phabricator.wikimedia.org/T384372) (owner: 10DLynch) [20:53:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10676148 (10VRiley-WMF) Replaced drives in an-worker1180, 1181, 1182 [20:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:53:18] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1128921|Enable VisualEditor EditCheck multi-check a/b test on remaining wikis (T384372)]] [20:53:20] though since scap-backport was introduced it's very straightforward [20:53:24] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [20:53:38] * ryankemper was in the process of typing "FWIW this was my first time with the new automated process and it was super straightforward" [20:53:42] xD [20:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:55:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2285.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:29] Yeah, my worries are entirely around "what if something actually goes *wrong*" :D [20:55:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:57:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2285.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:57:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:57:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:59:23] (03CR) 10BCornwall: [C:03+2] upgrade cp5022 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130736 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:59:33] (03PS2) 10BCornwall: upgrade cp5026 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130740 (https://phabricator.wikimedia.org/T378737) [20:59:36] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5026 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130740 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:59:40] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1128921|Enable VisualEditor EditCheck multi-check a/b test on remaining wikis (T384372)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:59:45] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250325T2100) [21:01:13] !log kemayo@deploy1003 kemayo: Continuing with sync [21:02:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2285.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:03:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:03:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:03:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:03:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:04:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:08:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2289 to codfw - jhancock@cumin2002" [21:08:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2289 to codfw - jhancock@cumin2002" [21:08:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:08:18] (03PS1) 10Andrew Bogott: Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1131096 (https://phabricator.wikimedia.org/T389965) [21:08:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2289 [21:08:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2290 [21:08:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2289 [21:08:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2292 [21:08:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2290 [21:08:41] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1128921|Enable VisualEditor EditCheck multi-check a/b test on remaining wikis (T384372)]] (duration: 15m 23s) [21:08:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2292 [21:08:46] T384372: Deploy config change to start the Multi-Reference Check A/B Test - https://phabricator.wikimedia.org/T384372 [21:08:50] tgr_: Okay, done with mine. [21:08:54] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2289 [21:09:01] thx [21:09:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2289 [21:09:07] who else is still around? [21:09:09] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2293 [21:09:15] anzx? [21:09:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2293 [21:09:22] tgr_: o/ [21:09:50] I'll do HouseOfM's patch too, it looks trivial [21:10:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:10:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:10:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2292.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:10:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130659 (https://phabricator.wikimedia.org/T386428) (owner: 10Daimona Eaytoy) [21:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130752 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [21:11:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130981 (https://phabricator.wikimedia.org/T388955) (owner: 10Anzx) [21:11:46] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/1131096 (https://phabricator.wikimedia.org/T389965) (owner: 10Andrew Bogott) [21:11:56] (03Merged) 10jenkins-bot: Drop unused $wgCampaignEventsSeparateOngoingEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130659 (https://phabricator.wikimedia.org/T386428) (owner: 10Daimona Eaytoy) [21:11:58] (03Merged) 10jenkins-bot: Enable SUL3 login for 10% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130752 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [21:12:00] (03Merged) 10jenkins-bot: knwikisource, tcywikisource: add translate namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130981 (https://phabricator.wikimedia.org/T388955) (owner: 10Anzx) [21:12:25] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1130659|Drop unused $wgCampaignEventsSeparateOngoingEvents (T386428)]], [[gerrit:1130752|Enable SUL3 login for 10% of group 2 users (T384219)]], [[gerrit:1130981|knwikisource, tcywikisource: add translate namespace (T388955)]] [21:12:33] T386428: Drop feature flag for Special:AllEvents section UI - https://phabricator.wikimedia.org/T386428 [21:12:33] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [21:12:33] T388955: Add Translate namespace in Kannada and Tulu wikisource - https://phabricator.wikimedia.org/T388955 [21:13:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:15:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:15:37] brett: are those eqsin Puppet failures expected from T378737? [21:15:37] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [21:16:11] (e.g. https://puppetboard.wikimedia.org/report/cp5022.eqsin.wmnet/c664c6a327a82204cf72b7680650a4de377850c8) [21:18:41] !log tgr@deploy1003 daimona, anzx, tgr: Backport for [[gerrit:1130659|Drop unused $wgCampaignEventsSeparateOngoingEvents (T386428)]], [[gerrit:1130752|Enable SUL3 login for 10% of group 2 users (T384219)]], [[gerrit:1130981|knwikisource, tcywikisource: add translate namespace (T388955)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:48] T386428: Drop feature flag for Special:AllEvents section UI - https://phabricator.wikimedia.org/T386428 [21:18:48] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [21:18:49] T388955: Add Translate namespace in Kannada and Tulu wikisource - https://phabricator.wikimedia.org/T388955 [21:18:59] checking [21:19:48] tgr_: namespace change looks good [21:20:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:20:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:20:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2292.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:21:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:21:57] !log tgr@deploy1003 daimona, anzx, tgr: Continuing with sync [21:22:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2292.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:24:03] anzx: this is a normal namespace, right? Not the Translate extension? [21:24:23] so it will need namespaceDupes.php? [21:24:32] rzl: Yes [21:24:50] Just popped back in, thanks tgr_! [21:25:18] tgr_: yes normal namespace, need to run namespacedupes [21:25:42] puppet is expected to fail the first go around on the upgrade due as the new VCL is not backwards-compatible. Once it upgrades varnish and restarts it's alright again [21:26:02] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10676294 (10phaultfinder) [21:26:05] (03PS1) 10Bking: cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) [21:26:28] (03CR) 10CI reject: [V:04-1] cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [21:26:32] brett: okay, thanks [21:27:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2292.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:29:00] (03PS2) 10Bking: cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) [21:29:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:29:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:29:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:29:23] (03CR) 10CI reject: [V:04-1] cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [21:29:29] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130659|Drop unused $wgCampaignEventsSeparateOngoingEvents (T386428)]], [[gerrit:1130752|Enable SUL3 login for 10% of group 2 users (T384219)]], [[gerrit:1130981|knwikisource, tcywikisource: add translate namespace (T388955)]] (duration: 17m 04s) [21:29:36] T386428: Drop feature flag for Special:AllEvents section UI - https://phabricator.wikimedia.org/T386428 [21:29:36] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [21:29:37] T388955: Add Translate namespace in Kannada and Tulu wikisource - https://phabricator.wikimedia.org/T388955 [21:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.11% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:30:25] (03CR) 10BCornwall: [C:03+2] upgrade cp5023 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130737 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:30:36] (03PS1) 10RLazarus: httpbb: Test /view/fr/Z1 case-insensitively [puppet] - 10https://gerrit.wikimedia.org/r/1131101 (https://phabricator.wikimedia.org/T383032) [21:30:38] (03PS2) 10BCornwall: upgrade cp5025 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130739 (https://phabricator.wikimedia.org/T378737) [21:30:41] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp5025 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130739 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:31:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:31:59] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5022.eqsin.wmnet} and A:cp [21:32:10] !log UTC late deploys done [21:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5026.eqsin.wmnet} and A:cp [21:33:54] tgr_: thanks for deploying, was namespacedupes run on both wikis [21:34:03] yeah, no dupes [21:34:30] thanks [21:34:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:35:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:38:13] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5022.eqsin.wmnet} and A:cp [21:40:06] (03PS3) 10Bking: cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) [21:40:23] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5026.eqsin.wmnet} and A:cp [21:40:29] (03CR) 10CI reject: [V:04-1] cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [21:43:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:44:25] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5023.eqsin.wmnet} and A:cp [21:44:27] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5025.eqsin.wmnet} and A:cp [21:44:53] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:51:04] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5025.eqsin.wmnet} and A:cp [21:51:20] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5023.eqsin.wmnet} and A:cp [21:51:53] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389990#10676388 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm expected interuption from "Emergency Planned Work PWIC261816 Notification" from Aerlion [21:52:09] 10ops-codfw, 06SRE, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389991#10676392 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm expected interuption from "Emergency Planned Work PWIC261816 Notification" from Aerlion [21:54:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2294 to codfw - jhancock@cumin2002" [21:54:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2294 to codfw - jhancock@cumin2002" [21:54:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:56:36] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2294 [21:56:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2294 [21:56:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2295 [21:56:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2295 [21:56:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2296 [21:57:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2296 [21:57:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2297 [21:57:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2297 [21:57:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2298 [21:57:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2298 [21:57:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2299 [21:57:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2299 [22:01:37] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:04:06] (03CR) 10Stoyofuku-wmf: "Looks good overall - I had a couple of questions about the test, and a note about the config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:05:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2300 to codfw - jhancock@cumin2002" [22:05:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2300 to codfw - jhancock@cumin2002" [22:06:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:06:08] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2300 [22:06:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2300 [22:06:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2301 [22:06:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2301 [22:06:35] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2302 [22:06:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2302 [22:07:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:07:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:07:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:07:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:08:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:11:32] (03PS3) 10Cwhite: prometheus: add recording rules for use by histogram_quantile [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) [22:11:52] (03CR) 10Cwhite: prometheus: add recording rules for use by histogram_quantile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [22:18:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:18:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:18:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:18:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:18:47] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [22:19:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:19:38] (03CR) 10BCornwall: [C:03+2] upgrade cp5024 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130738 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:19:54] 10ops-codfw, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008 (10phaultfinder) 03NEW [22:20:49] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5024.eqsin.wmnet} and A:cp [22:22:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:22:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:23:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:23:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:23:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:27:36] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5024.eqsin.wmnet} and A:cp [22:30:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:30:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:30:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:30:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:31:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:32:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:32:29] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:32:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:32:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:33:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:42:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:43:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:43:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:43:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:44:22] PROBLEM - Ensure traffic_manager is running for instance backend on cp7007 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:44:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:44:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:45:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:45:22] RECOVERY - Ensure traffic_manager is running for instance backend on cp7007 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:45:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:50:06] In one hour there will probably be a few alerts for varnishkafka processes missing. Those can be ignored as a fix is incoming [22:50:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676523 (10phaultfinder) [22:55:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:55:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:55:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:55:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:58:27] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10676527 (10phaultfinder) [23:02:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2272.codfw.wmnet with OS bookworm [23:02:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2272.codfw.wmnet with... [23:02:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2273.codfw.wmnet with OS bookworm [23:02:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2273.codfw.wmnet with... [23:02:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2274.codfw.wmnet with OS bookworm [23:02:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2274.codfw.wmnet with... [23:02:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2275.codfw.wmnet with OS bookworm [23:02:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2275.codfw.wmnet with... [23:03:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2276.codfw.wmnet with OS bookworm [23:03:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2276.codfw.wmnet with... [23:05:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676539 (10phaultfinder) [23:10:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676543 (10phaultfinder) [23:11:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676544 (10Jhancock.wm) [23:13:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2273.codfw.wmnet with reason: host reimage [23:14:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2272.codfw.wmnet with reason: host reimage [23:14:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2274.codfw.wmnet with reason: host reimage [23:14:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2275.codfw.wmnet with reason: host reimage [23:14:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2276.codfw.wmnet with reason: host reimage [23:15:28] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10676549 (10Scott_French) Following up, I spent a bit more time in the code earlier today, and... [23:16:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2273.codfw.wmnet with reason: host reimage [23:19:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2274.codfw.wmnet with reason: host reimage [23:21:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2275.codfw.wmnet with reason: host reimage [23:25:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2272.codfw.wmnet with reason: host reimage [23:30:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:31:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:31:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2273.codfw.wmnet with OS bookworm [23:31:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676572 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2273.codfw.wmnet with OS... [23:31:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2276.codfw.wmnet with reason: host reimage [23:34:29] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:35:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:35:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:35:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2274.codfw.wmnet with OS bookworm [23:35:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2274.codfw.wmnet with OS... [23:36:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:37:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:37:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2275.codfw.wmnet with OS bookworm [23:37:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2275.codfw.wmnet with OS... [23:40:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:41:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:41:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2272.codfw.wmnet with OS bookworm [23:41:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2272.codfw.wmnet with OS... [23:46:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:47:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:47:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2276.codfw.wmnet with OS bookworm [23:47:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2276.codfw.wmnet with OS... [23:48:01] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10676590 (102003problems) >>! In T214998#10676078, @bd808 wrote: > @toni.stoev Please read https://www.medi... [23:48:29] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:51:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2277.codfw.wmnet with OS bookworm [23:51:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676593 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2277.codfw.wmnet with... [23:51:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2278.codfw.wmnet with OS bookworm [23:51:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2278.codfw.wmnet with... [23:51:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2284.codfw.wmnet with OS bookworm [23:51:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2284.codfw.wmnet with... [23:52:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2285.codfw.wmnet with OS bookworm [23:52:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2285.codfw.wmnet with... [23:52:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2292.codfw.wmnet with OS bookworm [23:52:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2292.codfw.wmnet with...