[00:11:14] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:11:50] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:14:56] vriley@cumin1003 reimage (PID 3337403) is awaiting input [00:16:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [00:50:10] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1273.eqiad.wmnet with OS bookworm [00:50:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1273.eqiad.wmnet with OS bookworm completed: - db1273 (**PASS**) -... [00:52:35] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [00:56:25] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1274] - vriley@cumin1003" [00:56:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1274] - vriley@cumin1003" [00:56:31] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:57:38] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1274 [00:58:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1274 [00:59:39] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1274.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:04:00] vriley@cumin1003 provision (PID 3352153) is awaiting input [01:08:16] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [01:09:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 (owner: 10TrainBranchBot) [01:12:01] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1275] - vriley@cumin1003" [01:12:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1275] - vriley@cumin1003" [01:12:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:12:30] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1275 [01:12:59] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1274.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:14:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1275 [01:18:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1275.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:20:18] jouncebot: nowandnext [01:20:18] No deployments scheduled for the next 4 hour(s) and 39 minute(s) [01:20:18] In 4 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0600) [01:21:33] (03PS1) 10Zabe: Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) [01:22:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1286531 (owner: 10TrainBranchBot) [01:22:54] (03CR) 10Zabe: [C:03+2] Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:23:49] (03Merged) 10jenkins-bot: Start reading from new tables everywhere except commons (2nd try) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286532 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [01:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 11h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [01:25:59] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] [01:26:02] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:27:14] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1274.eqiad.wmnet with OS bookworm [01:27:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm [01:27:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:28:25] !log zabe@deploy1003 zabe: Continuing with deployment [01:32:34] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286532|Start reading from new tables everywhere except commons (2nd try) (T416548)]] (duration: 06m 35s) [01:32:38] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [01:37:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1275.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:41:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915532 (10VRiley-WMF) 05Open→03Resolved [01:41:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915535 (10VRiley-WMF) 05Resolved→03Open [01:42:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915538 (10VRiley-WMF) [01:43:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:28] (03CR) 10Dragoniez: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [01:58:03] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1275.eqiad.wmnet with OS bookworm [01:58:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1275.eqiad.wmnet with OS bookworm [02:00:47] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:02] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [02:07:31] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 44s) [02:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:48] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1276] - vriley@cumin1003" [02:10:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1276] - vriley@cumin1003" [02:10:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:11:43] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1276 [02:13:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:45] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1275.eqiad.wmnet with reason: host reimage [02:15:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1276 [02:15:04] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:15:22] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:16:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1276.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:16:22] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [02:18:33] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1275.eqiad.wmnet with reason: host reimage [02:19:33] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915556 (10VRiley-WMF) [02:19:41] (03CR) 10Dragoniez: viwikivoyage: enable relatedarticle and pop-up Bug: T405724 Change-Id: I93fb76ed14880bd5b7a7fe25bd64fe5d86ed063d (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [02:19:46] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1274.eqiad.wmnet with OS bookworm [02:19:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1274.eqiad.wmnet with OS bookworm executed with errors: - db1274 (**F... [02:21:24] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [02:25:24] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1277] - vriley@cumin1003" [02:25:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1277] - vriley@cumin1003" [02:25:30] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:26:01] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:26:18] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1277 [02:27:22] 06SRE, 10corto, 10Incident Tooling: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11915560 (10Peachey88) [02:28:00] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1277 [02:28:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:28:51] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1277.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:31:31] 06SRE, 10corto, 10Incident Tooling: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11915565 (10Soda) [02:33:26] FIRING: [50x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1276.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:35:40] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [02:37:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [02:37:38] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1275.eqiad.wmnet with OS bookworm [02:37:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1275.eqiad.wmnet with OS bookworm completed: - db1275 (**PASS**) -... [02:49:02] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:49:06] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1276.eqiad.wmnet with OS bookworm [02:49:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1276.eqiad.wmnet with OS bookworm [02:50:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1277.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:50:17] (03PS1) 10Ryan Kemper: archiva: block scraper UAs at nginx [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) [02:51:33] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [02:59:16] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 117.17 ms [02:59:22] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:02:11] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1277.eqiad.wmnet with OS bookworm [03:02:22] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:02:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915576 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1277.eqiad.wmnet with OS bookworm [03:03:54] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [03:04:45] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1276.eqiad.wmnet with reason: host reimage [03:07:48] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1278] - vriley@cumin1003" [03:07:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [db1278] - vriley@cumin1003" [03:07:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:08:08] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host db1278 [03:09:03] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1276.eqiad.wmnet with reason: host reimage [03:09:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1278 [03:10:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host db1278.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:11:40] (03PS6) 10Nvdtn19: viwikivoyage: enable relatedarticle and pop-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) [03:17:53] (03CR) 10Nvdtn19: viwikivoyage: enable relatedarticle and pop-up (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [03:17:55] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1277.eqiad.wmnet with reason: host reimage [03:24:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1277.eqiad.wmnet with reason: host reimage [03:25:23] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [03:25:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [03:25:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1276.eqiad.wmnet with OS bookworm [03:26:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1276.eqiad.wmnet with OS bookworm completed: - db1276 (**PASS**) -... [03:28:20] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1278.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:33:26] (03CR) 10Dragoniez: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [03:41:33] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host db1278.eqiad.wmnet with OS bookworm [03:41:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host db1278.eqiad.wmnet with OS bookworm [03:42:20] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [03:42:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [03:42:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1277.eqiad.wmnet with OS bookworm [03:42:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1277.eqiad.wmnet with OS bookworm completed: - db1277 (**PASS**) -... [03:51:26] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:57:33] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1278.eqiad.wmnet with reason: host reimage [04:00:21] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:01:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:01:21] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:02:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:03:19] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1278.eqiad.wmnet with reason: host reimage [04:05:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (https://phabricator.wikimedia.org/T405724) (owner: 10Nvdtn19) [04:06:17] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:20:31] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [04:20:53] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [04:20:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1278.eqiad.wmnet with OS bookworm [04:21:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host db1278.eqiad.wmnet with OS bookworm completed: - db1278 (**PASS**) -... [04:21:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1265-db1290 - https://phabricator.wikimedia.org/T418909#11915630 (10VRiley-WMF) [04:45:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:45:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:46:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:46:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:47:27] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:48:09] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:49:27] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:50:19] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:52:21] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:10:33] (03CR) 10Marostegui: [C:03+2] data.yaml: Add catherinekelsey to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1286291 (https://phabricator.wikimedia.org/T425565) (owner: 10Marostegui) [05:12:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11915644 (10Marostegui) 05In progress→03Resolved a:03Marostegui The change has been deployed.... [05:12:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics_privatedata_users & Kerberos & SQL Lab for catherinekelsey - https://phabricator.wikimedia.org/T425565#11915647 (10Marostegui) [05:13:59] (03CR) 10Marostegui: [C:03+2] admin: update SSH key for Kartik [puppet] - 10https://gerrit.wikimedia.org/r/1286461 (https://phabricator.wikimedia.org/T425853) (owner: 10Dzahn) [05:15:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11915654 (10Marostegui) 05In progress→03Resolved a:03Dzahn I've merged the change Daniel pushed. @KartikMistry please allow 20-30 minutes for the key to get spread across product... [05:15:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:16:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:16:19] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:19:20] (03CR) 10Marostegui: [C:03+2] data.yaml: Adding cwilliams to users [puppet] - 10https://gerrit.wikimedia.org/r/1285368 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [05:19:27] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:20:11] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:21:16] (03CR) 10Marostegui: [C:03+2] data.yaml: Adding cwilliams to ops [puppet] - 10https://gerrit.wikimedia.org/r/1285369 (https://phabricator.wikimedia.org/T425930) (owner: 10CWilliams) [05:25:23] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 15h 30m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [05:26:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11915661 (10Marostegui) 05In progress→03Resolved a:05CWilliams-WMF→03Marostegui ssh access has been deployed. ou can try to connect to cumin1003.eqiad.wmnet, which is a... [05:28:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284900 (https://phabricator.wikimedia.org/T316393) (owner: 10Codename Noreste) [05:31:27] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:34:25] (03PS1) 10Marostegui: db1253,db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286735 (https://phabricator.wikimedia.org/T425388) [05:35:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2218.codfw.wmnet with reason: Reimage to Trixie [05:35:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2218: Reimage to Trixie [05:35:33] (03CR) 10Marostegui: [C:03+2] db1253,db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286735 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [05:35:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1253.eqiad.wmnet with reason: Reimage to Trixie [05:35:45] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db1253: Reimage to Trixie [05:35:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2218: Reimage to Trixie [05:36:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1253: Reimage to Trixie [05:37:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2218.codfw.wmnet with OS trixie [05:37:59] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1253.eqiad.wmnet with OS trixie [05:43:29] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 49400720 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:44:29] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 200368 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:49:54] 06SRE, 10SRE-Access-Requests: Update ssh key for kartik - https://phabricator.wikimedia.org/T425853#11915689 (10KartikMistry) Thanks a lot @Marostegui and @Dzahn [05:54:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1253.eqiad.wmnet with reason: host reimage [05:57:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2218.codfw.wmnet with reason: host reimage [05:59:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1253.eqiad.wmnet with reason: host reimage [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0600) [06:03:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2218.codfw.wmnet with reason: host reimage [06:05:20] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:07:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:08:10] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:11:01] (03PS1) 10Marostegui: Revert "db1253,db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286742 [06:14:58] (03CR) 10Marostegui: [C:03+2] Revert "db1253,db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286742 (owner: 10Marostegui) [06:19:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286277 (owner: 10DCausse) [06:22:35] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1253.eqiad.wmnet with OS trixie [06:26:01] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and asw1-23-ulsfo (198.35.26.149) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Switch&var-bgp_neighbor=asw1-23-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:26:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db1253: after reimage to trixie [06:27:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2218.codfw.wmnet with OS trixie [06:30:34] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2218: after reimage to trixie [06:30:57] jouncebot: now [06:30:58] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0600) [06:31:10] oh grand this is our window [06:32:26] (03CR) 10Slyngshede: [C:03+1] "Looks good. Minor nit." [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [06:32:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:33:41] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:45] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter_wancache: add mc1056-mc1059 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286392 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [06:36:26] (03CR) 10Slyngshede: [C:03+1] hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [06:39:53] !log installing Exim security updates on the hosts where Exim is used as a local mail relay [06:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:13] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1286743 (https://phabricator.wikimedia.org/T426142) [06:52:44] (03CR) 10Awight: [C:03+1] testwiki: Disable sub-ref's synthetic list defined refs on test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286400 (https://phabricator.wikimedia.org/T425967) (owner: 10WMDE-Fisch) [06:56:33] (03PS1) 10Effie Mouzeli: regex.yaml: bump memcached size to ~243GB for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) [06:59:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0700). [07:00:05] atsukoito, WMDE-Fisch, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:00:18] \o [07:00:19] I can deploy [07:00:32] (03PS2) 10Effie Mouzeli: regex.yaml: bump memcached size to ~243GB for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) [07:00:45] WMDE-Fisch: dcausse I will need a little bit [07:00:46] dcausse: Fine for me :-) [07:00:48] from your window [07:00:50] hi-hi! [07:00:52] I am not done with mine [07:00:54] effie: sure np [07:01:01] thank you folks, it should be quick [07:01:20] (03CR) 10JMeybohm: regex.yaml: bump memcached size to ~243GB for eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [07:01:21] atsukoito: do you mind if I deploy WMDE-Fisch patch and mine first so we have the rest of the window for testing [07:01:21] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [07:02:01] WMDE-Fisch: do you mind if I bundle your patch with mine (mine is low risk) [07:02:02] sure [07:02:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:04:11] (03CR) 10Effie Mouzeli: regex.yaml: bump memcached size to ~243GB for eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [07:04:46] dcausse: nope [07:04:53] cool [07:04:59] :-) [07:05:05] (03CR) 10Effie Mouzeli: regex.yaml: bump memcached size to ~243GB for eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [07:07:30] (03CR) 10Effie Mouzeli: [C:03+2] regex.yaml: bump memcached size to ~243GB for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1286744 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [07:11:12] sorry folks, puppet running [07:11:23] np! [07:11:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1253: after reimage to trixie [07:12:47] (03CR) 10Trueg: [C:03+1] "lgtm" [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [07:15:48] (03CR) 10Muehlenhoff: [C:03+2] thumbor-plugins: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1285784 (owner: 10Muehlenhoff) [07:15:59] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2218: after reimage to trixie [07:17:47] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [07:17:55] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [07:18:37] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [07:18:46] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [07:19:38] (03PS1) 10Ryan Kemper: global_config: add ldap-sync external services [puppet] - 10https://gerrit.wikimedia.org/r/1286748 (https://phabricator.wikimedia.org/T420691) [07:20:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:21:29] dcausse: WMDE-Fisch atsukoito you may proceed [07:21:35] effie: thanks! [07:21:40] sorry for taking a piece off your window [07:21:43] WMDE-Fisch: still around? :) [07:21:48] np! [07:21:50] dcausse: Sure :-) [07:23:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286400 (https://phabricator.wikimedia.org/T425967) (owner: 10WMDE-Fisch) [07:23:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286277 (owner: 10DCausse) [07:24:11] (03Merged) 10jenkins-bot: testwiki: Disable sub-ref's synthetic list defined refs on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286400 (https://phabricator.wikimedia.org/T425967) (owner: 10WMDE-Fisch) [07:24:15] (03Merged) 10jenkins-bot: Revert^2 "cirrus: use a keywork tokenizer for the plain field for autocomplete" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286277 (owner: 10DCausse) [07:25:00] (03CR) 10Gkyziridis: [C:03+2] changeprop: Configure all wikis for revertrisk-multilingual events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283758 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:25:04] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1286400|testwiki: Disable sub-ref's synthetic list defined refs on test wikis (T425967)]], [[gerrit:1286277|Revert^2 "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] [07:25:08] T425967: Stop creating synthetic main refs on test.wikipedia - https://phabricator.wikimedia.org/T425967 [07:25:08] (03PS1) 10Ryan Kemper: airflow-test-k8s: add ldap-sync task-pod egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286750 (https://phabricator.wikimedia.org/T420691) [07:27:05] (03Merged) 10jenkins-bot: changeprop: Configure all wikis for revertrisk-multilingual events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1283758 (https://phabricator.wikimedia.org/T415892) (owner: 10Gkyziridis) [07:27:16] !log dcausse@deploy1003 dcausse, wmde-fisch: Backport for [[gerrit:1286400|testwiki: Disable sub-ref's synthetic list defined refs on test wikis (T425967)]], [[gerrit:1286277|Revert^2 "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:27:42] WMDE-Fisch: should be ready for testing [07:28:08] *testing* [07:28:11] (03PS2) 10Fabfur: hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) [07:28:35] dcausse: this might be the tag, as well 2026-05-13-072539-publish-83 [07:28:40] (03CR) 10Fabfur: haproxy,aptrepo: start testing haproxy-awslc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [07:29:48] dcausse: Looks good :-) [07:30:19] WMDE-Fisch: ok shipping [07:30:25] !log dcausse@deploy1003 dcausse, wmde-fisch: Continuing with deployment [07:32:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:34:36] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286400|testwiki: Disable sub-ref's synthetic list defined refs on test wikis (T425967)]], [[gerrit:1286277|Revert^2 "cirrus: use a keywork tokenizer for the plain field for autocomplete"]] (duration: 09m 32s) [07:34:40] T425967: Stop creating synthetic main refs on test.wikipedia - https://phabricator.wikimedia.org/T425967 [07:34:46] WMDE-Fisch: should be live [07:35:59] dcausse: It is! :-) Thx! [07:36:03] thanks! [07:36:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286371 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:36:41] Hi I will deploy changes on changeprop [07:36:48] (03PS2) 10Ryan Kemper: global_config: add ldap-sync external services [puppet] - 10https://gerrit.wikimedia.org/r/1286748 (https://phabricator.wikimedia.org/T420691) [07:37:16] !log gkyziridis@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [07:37:42] (03Merged) 10jenkins-bot: translate: add opensearch-ttmserver-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286371 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [07:37:43] !log gkyziridis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [07:38:05] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1286371|translate: add opensearch-ttmserver-test (T425377)]] [07:38:08] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:39:11] !log gkyziridis@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: sync [07:39:34] !log gkyziridis@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [07:40:05] !log dcausse@deploy1003 atsuko, dcausse: Backport for [[gerrit:1286371|translate: add opensearch-ttmserver-test (T425377)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:42:02] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286748 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [07:43:09] !log dcausse@deploy1003 atsuko, dcausse: Continuing with deployment [07:45:06] (03PS1) 10Effie Mouzeli: site.pp add mc106[0-9]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286758 (https://phabricator.wikimedia.org/T418263) [07:47:14] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286371|translate: add opensearch-ttmserver-test (T425377)]] (duration: 09m 09s) [07:47:17] T425377: Migrate Ttmserver (Translatewiki application) indices from production OpenSearch to OpenSearch on k8s - https://phabricator.wikimedia.org/T425377 [07:49:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [07:49:39] (03PS1) 10Effie Mouzeli: mcrouter_wancache: add mc1060-mc1063 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286759 (https://phabricator.wikimedia.org/T418263) [07:51:35] (03PS1) 10Effie Mouzeli: mcrouter_wancache: add mc1064-mc1067 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286775 (https://phabricator.wikimedia.org/T418263) [07:51:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:39] (03PS1) 10Effie Mouzeli: mcrouter_wancache: add mc1068-mc1069 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286793 (https://phabricator.wikimedia.org/T418263) [07:56:47] !log imported dnsmasq 2.92-1~wmf12u2 to bookworm-wikimedia/main (backport of latest dnsmasq security fixes to our internal build) [07:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] andre and brennen: MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0800). Please do the needful. [08:00:16] jouncebot: nowandnext [08:00:16] For the next 1 hour(s) and 59 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0800) [08:00:16] In 1 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1000) [08:00:24] andre: can I sync a config patch? [08:00:26] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286803 [08:00:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:01:26] (03CR) 10Kosta Harlan: [C:03+1] opensearch on k8s: Enable service mesh for clusters [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [08:02:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:03:02] (03PS1) 10Kosta Harlan: IPReputation: Route opensearch_ipoid through envoy service mesh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286804 (https://phabricator.wikimedia.org/T421293) [08:04:00] kostajh: yes, go ahead, I'll wait [08:04:07] Thanks [08:05:12] (03PS1) 10Kosta Harlan: WikimediaEvents: Enable Special:UserLogin instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286805 (https://phabricator.wikimedia.org/T425631) [08:08:43] !log reconfigure link from cr4-ulsfo to asw1-22-ulsfo as 802.1q tagged T424611 [08:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:46] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [08:08:50] andre: never mind, there’s an issue with the test kitchen instrument my patch depends on, so I’ll go after the train, if there’s time [08:08:59] kostajh, okay [08:11:04] !log imported dnsmasq 2.92-1~wmf13u2 to trixie-wikimedia/main (backport of latest dnsmasq security fixes to our internal build) [08:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:23] (03CR) 10Fabfur: [C:03+2] haproxy,aptrepo: start testing haproxy-awslc [puppet] - 10https://gerrit.wikimedia.org/r/1286521 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [08:11:33] (03CR) 10Atsuko: [C:03+1] global_config: add ldap-sync external services [puppet] - 10https://gerrit.wikimedia.org/r/1286748 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [08:12:04] andre: I fixed the testkitchen issue, so I can go ahead. But I can also wait. Up to you [08:12:17] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11915999 (10Aklapper) [08:12:25] (03CR) 10Atsuko: [C:03+1] airflow-test-k8s: add ldap-sync task-pod egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286750 (https://phabricator.wikimedia.org/T420691) (owner: 10Ryan Kemper) [08:12:31] kostajh: you please go ahead then [08:12:44] (03CR) 10JMeybohm: [C:03+1] site.pp add mc106[0-9]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286758 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [08:12:48] ok [08:12:56] kostajh, please give me heads-up once you're done - thanks! [08:13:12] Will do [08:13:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286805 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [08:13:31] (03CR) 10Effie Mouzeli: [C:03+2] site.pp add mc106[0-9]* memecached servers [puppet] - 10https://gerrit.wikimedia.org/r/1286758 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [08:14:26] (03Merged) 10jenkins-bot: WikimediaEvents: Enable Special:UserLogin instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286805 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [08:14:49] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1286805|WikimediaEvents: Enable Special:UserLogin instrumentation (T425631)]] [08:14:54] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [08:16:23] (03PS1) 10JMeybohm: Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) [08:16:44] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1286805|WikimediaEvents: Enable Special:UserLogin instrumentation (T425631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:16:54] (03CR) 10CI reject: [V:04-1] Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [08:19:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:20:00] !log kharlan@deploy1003 kharlan: Continuing with deployment [08:21:50] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286803 (owner: 10Muehlenhoff) [08:24:07] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286805|WikimediaEvents: Enable Special:UserLogin instrumentation (T425631)]] (duration: 09m 18s) [08:24:11] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [08:25:12] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [08:25:31] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [08:26:11] (03PS6) 10WAN233: change logo at zh-classical wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) [08:27:31] (03PS1) 10Cathal Mooney: Reverse PTR include: add statement for 2620:0:863:fe0a::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286809 (https://phabricator.wikimedia.org/T408892) [08:27:52] andre: all done [08:27:58] thanks! [08:28:04] I will now start promoting group1 wikis to 1.47.0-wmf.2 [08:28:14] (03CR) 10CI reject: [V:04-1] Reverse PTR include: add statement for 2620:0:863:fe0a::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286809 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [08:28:35] (03PS1) 10TrainBranchBot: group1 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286810 (https://phabricator.wikimedia.org/T423911) [08:28:38] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286810 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:28:54] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [08:29:33] (03Merged) 10jenkins-bot: group1 to 1.47.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286810 (https://phabricator.wikimedia.org/T423911) (owner: 10TrainBranchBot) [08:30:14] (03PS1) 10Tiziano Fogli: thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) [08:30:45] (03CR) 10CI reject: [V:04-1] thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) (owner: 10Tiziano Fogli) [08:31:33] (03PS2) 10Tiziano Fogli: thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) [08:32:02] (03CR) 10CI reject: [V:04-1] thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) (owner: 10Tiziano Fogli) [08:32:27] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [08:32:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:32:31] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add include for 2620:0:863:fe0a::/64 - cmooney@cumin1003" [08:32:39] (03PS2) 10Cathal Mooney: Reverse PTR include: add statement for 2620:0:863:fe0a::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286809 (https://phabricator.wikimedia.org/T408892) [08:33:07] (03PS1) 10Ayounsi: Add more profile::server_depool policies to DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/1286812 (https://phabricator.wikimedia.org/T425334) [08:33:15] (03PS3) 10Tiziano Fogli: thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) [08:34:34] (03CR) 10Ayounsi: [C:03+1] Reverse PTR include: add statement for 2620:0:863:fe0a::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286809 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [08:35:36] cmooney@cumin1003 netbox (PID 3414908) is awaiting input [08:35:48] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.47.0-wmf.2 refs T423911 [08:35:52] T423911: 1.47.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T423911 [08:36:01] (03CR) 10Cathal Mooney: [C:03+2] Reverse PTR include: add statement for 2620:0:863:fe0a::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286809 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [08:36:02] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [08:36:54] (03PS1) 10Fabfur: aptrepo: add haproxy gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1286815 (https://phabricator.wikimedia.org/T419825) [08:37:09] !log cmooney@dns2005 START - running authdns-update [08:38:24] !log cmooney@dns2005 END - running authdns-update [08:38:32] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [08:38:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add include for 2620:0:863:fe0a::/64 - cmooney@cumin1003" [08:38:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:53] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [08:43:25] (03PS1) 10Ayounsi: Add profile::server_depool policy for kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/1286820 (https://phabricator.wikimedia.org/T327300) [08:44:03] (03PS2) 10Ayounsi: Add profile::server_depool policy for kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/1286820 (https://phabricator.wikimedia.org/T327300) [08:45:04] (03CR) 10Slyngshede: [C:03+1] aptrepo: add haproxy gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1286815 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [08:45:08] FIRING: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 18h 47m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [08:45:58] !log installing dnsmasq security updates [08:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:39] (03CR) 10Elukey: [C:03+1] Add profile::server_depool policy for kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/1286820 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:49:32] (03CR) 10Ayounsi: [C:03+2] Add profile::server_depool policy for kafka hosts [puppet] - 10https://gerrit.wikimedia.org/r/1286820 (https://phabricator.wikimedia.org/T327300) (owner: 10Ayounsi) [08:50:08] RESOLVED: [2x] PKICertificateExpiry: Intermediate certificate in the trust chain for discovery expires in -9d 18h 48m 34s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [08:55:28] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:58:45] (03PS1) 10Cathal Mooney: Nokia: add description to vlan sub-interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1286829 (https://phabricator.wikimedia.org/T371088) [09:00:55] (03CR) 10Tiziano Fogli: [C:03+2] Remove deprecated /etc/icinga/objects/nsca_frack.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1285870 (https://phabricator.wikimedia.org/T425424) (owner: 10Jgreen) [09:02:28] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:03:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:57] (03CR) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [09:07:41] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [09:08:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:08] 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975#11916231 (10fnegri) +1 [09:10:25] (03CR) 10Ayounsi: [C:03+1] Nokia: add description to vlan sub-interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1286829 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [09:10:52] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add records for 2620:0:863:fe09::/64 - cmooney@cumin1003" [09:11:03] (03PS1) 10Cathal Mooney: Reverse PTR Include: add for 2620:0:863:fe09::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286831 (https://phabricator.wikimedia.org/T408892) [09:13:06] (03CR) 10Fabfur: [C:03+2] aptrepo: add haproxy gpg key [puppet] - 10https://gerrit.wikimedia.org/r/1286815 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [09:13:26] FIRING: [49x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:58] cmooney@cumin1003 netbox (PID 3419111) is awaiting input [09:14:14] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add records for 2620:0:863:fe09::/64 - cmooney@cumin1003" [09:14:14] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:31] jouncebot: nowandnext [09:14:31] For the next 0 hour(s) and 45 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T0800) [09:14:31] In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1000) [09:14:48] andre: Can I deploy a security patch? [09:14:50] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=pki,name=codfw [09:15:02] Dreamy_Jazz: yes, train is stable and done for today [09:15:09] Thanks [09:15:31] Deploying... [09:17:16] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11916288 (10MLechvien-WMF) a:03Raine Thanks for surfacin... [09:17:30] !log root@cumin1003 START - Cookbook sre.hosts.reimage for host mc1060.eqiad.wmnet with OS bullseye [09:17:39] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1061.eqiad.wmnet with OS bullseye [09:17:41] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1062.eqiad.wmnet with OS bullseye [09:17:43] !log jiji@cumin1003 START - Cookbook sre.hosts.reimage for host mc1064.eqiad.wmnet with OS bullseye [09:17:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host pki2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:18:26] FIRING: [7x] SystemdUnitFailed: cfssl-ocsprefresh-dse_front_proxy.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:37] (03CR) 10Ayounsi: [C:03+1] Reverse PTR Include: add for 2620:0:863:fe09::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286831 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [09:20:28] PROBLEM - Host pki2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:24] !log dreamyjazz Deployed security patch for T423840 [09:21:24] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:33] (03PS1) 10Elukey: installserver: move pki2002 to uefi [puppet] - 10https://gerrit.wikimedia.org/r/1286833 [09:21:37] FIRING: [21x] ProbeDown: Service pki2002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:51] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2002.codfw.wmnet with reason: reimage [09:23:26] FIRING: [7x] SystemdUnitFailed: cfssl-ocsprefresh-cassandra.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but then let's also upgrade firmware? pki2002 is from 2022" [puppet] - 10https://gerrit.wikimedia.org/r/1286833 (owner: 10Elukey) [09:24:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:25:06] !log elukey@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts pki2002.codfw.wmnet [09:25:29] (03Restored) 10Kosta Harlan: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [09:25:53] Dreamy_Jazz: I have a config patch when you’re done [09:26:06] (03CR) 10Elukey: [C:03+2] installserver: move pki2002 to uefi [puppet] - 10https://gerrit.wikimedia.org/r/1286833 (owner: 10Elukey) [09:26:12] (03CR) 10Cathal Mooney: [C:03+2] Reverse PTR Include: add for 2620:0:863:fe09::/64 [dns] - 10https://gerrit.wikimedia.org/r/1286831 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [09:27:15] (03PS1) 10Fabfur: aptrepo: missing libssl-awslc package in updates file [puppet] - 10https://gerrit.wikimedia.org/r/1286834 (https://phabricator.wikimedia.org/T419825) [09:27:21] !log dreamyjazz Deployed security patch for T423840 [09:27:24] !log cmooney@dns2005 START - running authdns-update [09:27:57] kostajh: Done [09:28:11] Thanks, I’ll start in a few minutes [09:28:38] !log cmooney@dns2005 END - running authdns-update [09:29:23] FIRING: JobUnavailable: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:48] !log root@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [09:29:57] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [09:30:03] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [09:30:04] !log jiji@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [09:30:05] (03PS3) 10Federico Ceratto: sre.mysql.major-upgrade: Support reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) [09:30:32] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 40278472 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:31:32] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3588288 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:31:42] (03PS2) 10Kosta Harlan: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) [09:32:48] (03CR) 10Btullis: archiva: block scraper UAs at nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [09:34:22] (03PS1) 10JMeybohm: ratelimit: Separate cache key prefix and key by underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286835 (https://phabricator.wikimedia.org/T414440) [09:34:25] (03PS1) 10JMeybohm: ratelimit-media: Fix ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286836 (https://phabricator.wikimedia.org/T414440) [09:34:33] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1060.eqiad.wmnet with reason: host reimage [09:34:51] (03PS3) 10Kosta Harlan: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) [09:35:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [09:35:28] (03CR) 10Marostegui: [C:03+1] "Go for it, thannks!" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [09:35:49] (03CR) 10Marostegui: "I'd do dbctl and then mariadb" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [09:36:04] (03Merged) 10jenkins-bot: EventStreamConfig: Register special_user_login event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284633 (https://phabricator.wikimedia.org/T425631) (owner: 10Kosta Harlan) [09:36:32] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1284633|EventStreamConfig: Register special_user_login event stream (T425631)]] [09:36:35] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [09:38:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286518 (https://phabricator.wikimedia.org/T423658) (owner: 10Jdlrobson) [09:38:28] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1284633|EventStreamConfig: Register special_user_login event stream (T425631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:38:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1061.eqiad.wmnet with reason: host reimage [09:38:58] (03PS2) 10JMeybohm: Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) [09:38:58] (03PS1) 10JMeybohm: tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) [09:39:36] (03CR) 10CI reject: [V:04-1] Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [09:39:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/2 (Core: asw1-23-ulsfo:ethernet-1/55 {#change_me10}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:40:30] ^^ this is fine, the cable is not installed, alert was supposed to have been ack'd [09:40:38] (03PS1) 10Marco Fossati: Add robust color fallbacks for QuoteCard average-color styling [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286838 (https://phabricator.wikimedia.org/T425358) [09:40:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286838 (https://phabricator.wikimedia.org/T425358) (owner: 10Marco Fossati) [09:41:13] (03CR) 10Slyngshede: [C:03+1] aptrepo: missing libssl-awslc package in updates file [puppet] - 10https://gerrit.wikimedia.org/r/1286834 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [09:41:25] !log kharlan@deploy1003 kharlan: Continuing with deployment [09:41:27] (03CR) 10CI reject: [V:04-1] tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [09:42:10] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1062.eqiad.wmnet with reason: host reimage [09:45:33] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284633|EventStreamConfig: Register special_user_login event stream (T425631)]] (duration: 09m 01s) [09:45:37] T425631: Instrument Special:UserLogin to detect non-JS form submissions - https://phabricator.wikimedia.org/T425631 [09:49:34] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286840 [09:50:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1064.eqiad.wmnet with reason: host reimage [09:50:11] !log root@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1060.eqiad.wmnet with OS bullseye [09:51:08] !log installing ca-certificates update from Bookworm point release [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2218 with weight 0 T426142', diff saved to https://phabricator.wikimedia.org/P92498 and previous config saved to /var/cache/conftool/dbconfig/20260513-095337-marostegui.json [09:53:45] T426142: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T426142 [09:53:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1061.eqiad.wmnet with OS bullseye [09:53:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T426142 [09:54:02] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1286743 (https://phabricator.wikimedia.org/T426142) (owner: 10Gerrit maintenance bot) [09:54:35] !log Starting s7 codfw failover from db2220 to db2218 - T426142 [09:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:36] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11916476 (10MoritzMuehlenhoff) [09:56:28] !log installing distro-info-data updates from Bookworm point release [09:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:06] (03PS2) 10Marco Fossati: Fixed card width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) [09:57:34] elukey@cumin1003 upgrade-firmware (PID 3422490) is awaiting input [09:57:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1062.eqiad.wmnet with OS bullseye [09:58:10] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [09:58:14] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [09:58:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2218 to s7 primary T426142', diff saved to https://phabricator.wikimedia.org/P92499 and previous config saved to /var/cache/conftool/dbconfig/20260513-095814-marostegui.json [09:59:03] (03PS1) 10Marco Fossati: Adjust image size to match fixed width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286844 (https://phabricator.wikimedia.org/T425710) [09:59:16] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11916484 (10MoritzMuehlenhoff) [09:59:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2220 T426142', diff saved to https://phabricator.wikimedia.org/P92500 and previous config saved to /var/cache/conftool/dbconfig/20260513-095934-marostegui.json [09:59:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [09:59:39] T426142: Switchover s7 master (db2220 -> db2218) - https://phabricator.wikimedia.org/T426142 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1000) [10:00:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286844 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [10:00:48] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286840 (owner: 10Muehlenhoff) [10:01:12] PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator [10:01:20] (03PS2) 10JMeybohm: tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) [10:01:20] (03PS3) 10JMeybohm: Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) [10:01:32] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [10:01:52] (03CR) 10CWilliams: "Thanks for the confirmation" [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [10:02:14] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [10:02:14] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:02:20] (03CR) 10CI reject: [V:04-1] Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:02:28] (03PS1) 10Marostegui: db2220: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286845 (https://phabricator.wikimedia.org/T425388) [10:02:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2220.codfw.wmnet with reason: Reimage to Trixie [10:02:54] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool db2220: Reimage to Trixie [10:02:56] (03PS1) 10Marco Fossati: ShareHighlight: exclude browsers that don't support CSS has [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286846 (https://phabricator.wikimedia.org/T424873) [10:03:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2220: Reimage to Trixie [10:03:10] (03CR) 10JMeybohm: [C:03+2] ratelimit-media: Fix ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286836 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:03:10] (03CR) 10Marostegui: [C:03+2] db2220: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1286845 (https://phabricator.wikimedia.org/T425388) (owner: 10Marostegui) [10:03:14] (03CR) 10JMeybohm: [C:03+2] ratelimit: Separate cache key prefix and key by underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286835 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:03:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286846 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [10:03:38] (03CR) 10CI reject: [V:04-1] tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:04:11] (03PS1) 10Marco Fossati: Also skip instrumentation for unsupported browsers [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286847 (https://phabricator.wikimedia.org/T424873) [10:04:23] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [10:04:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286847 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [10:05:13] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1064.eqiad.wmnet with OS bullseye [10:05:26] (03Merged) 10jenkins-bot: ratelimit: Separate cache key prefix and key by underscore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286835 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:05:32] (03Merged) 10jenkins-bot: ratelimit-media: Fix ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286836 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:05:36] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [10:06:02] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2220.codfw.wmnet with OS trixie [10:06:03] (03PS5) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1 [puppet] - 10https://gerrit.wikimedia.org/r/1285335 [10:09:05] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:09:08] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [10:09:14] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:09:31] * atsukoito re-deploying opensearch-ttmserver-test to opensearch2 [10:10:43] !log installing Apache security updates on Bullseye [10:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:03] (03PS4) 10JMeybohm: Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) [10:11:11] (03PS3) 10JMeybohm: tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) [10:12:21] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:12:36] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:14:35] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:14:46] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:15:15] (03CR) 10Fabfur: [C:03+2] aptrepo: missing libssl-awslc package in updates file [puppet] - 10https://gerrit.wikimedia.org/r/1286834 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:15:28] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [10:15:31] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [10:15:35] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:16:12] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [10:16:19] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [10:16:21] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:29] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [10:16:45] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [10:17:00] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [10:18:35] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:21:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts pki2002.codfw.wmnet [10:22:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host pki2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:23:33] (03CR) 10Muehlenhoff: [C:03+2] Switch pki2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1283640 (https://phabricator.wikimedia.org/T416664) (owner: 10Muehlenhoff) [10:24:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:24:50] 06SRE, 10SRE-Access-Requests: Adding cwilliams to users and ops - https://phabricator.wikimedia.org/T425930#11916661 (10Marostegui) ` When adding a user here, you should also add in the private puppet hiera data: # - An api token for requestctl, under profile::conftool::hiddenparma::api_tokens # -... [10:24:52] (03CR) 10Muehlenhoff: [C:03+2] Switch install6003 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1284600 (owner: 10Muehlenhoff) [10:25:16] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host pki2002.codfw.wmnet with OS trixie [10:25:32] (03PS6) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - refinery-sqoop-mediawiki-centralauth-production.sh to run the refinery-sqoop-centralauth-production to sqoop the centralauth production tables. - refinery-sqoop-mediawiki-clouddb.sh to replace refi [10:25:32] sequence refinery-sqoop-mediawiki-history and refinery-sqoop-mediawiki-not-history to sqoop the cloudb tables. - refinery-sqoop-mediawiki-production.sh to run in sequence refinery-sqoop-mediawiki-production-history and refinery-sqoop-mediawiki-production-not-history to sqoop production replicas tables. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [10:26:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2220.codfw.wmnet with reason: host reimage [10:27:00] (03PS7) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - refinery-sqoop-mediawiki-centralauth-production.sh to sqoop the centralauth production tables. - refinery-sqoop-mediawiki-clouddb.sh to sqoop the cloudb tables. - refinery-sqoop-mediawiki-productio [10:27:00] [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) [10:27:15] (03CR) 10Blake: [C:03+1] mcrouter_wancache: add mc1060-mc1063 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286759 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:27:31] (03CR) 10A-pizzata: changes to accelerate sqoop landing for mediawiki_history_incremental_v1. This is done by changing the previous refinery-sqoop-whole-mediawi (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:27:46] (03CR) 10Cathal Mooney: [C:03+2] Nokia: add description to vlan sub-interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1286829 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:28:00] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter_wancache: add mc1060-mc1063 to production [puppet] - 10https://gerrit.wikimedia.org/r/1286759 (https://phabricator.wikimedia.org/T418263) (owner: 10Effie Mouzeli) [10:28:26] FIRING: [18x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:01] (03CR) 10CI reject: [V:04-1] changes to accelerate sqoop landing for mediawiki_history_incremental_v1. This is done by changing the previous refinery-sqoop-whole-mediawiki.sh from one big sequential set of sqoops to a parallel structure: - refinery-sqoop-mediawiki-centralauth-production.sh to sqoop the centralauth production tables. - refinery-sqoop-mediawiki-clouddb.sh to sqoop the cloudb tables. - refinery-sqoop-mediawiki [10:29:01] tables. [puppet] - 10https://gerrit.wikimedia.org/r/1285335 (https://phabricator.wikimedia.org/T424355) (owner: 10A-pizzata) [10:29:11] (03Merged) 10jenkins-bot: Nokia: add description to vlan sub-interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/1286829 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:30:56] (03PS1) 10Marostegui: Revert "db2220: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286853 [10:31:07] 10SRE-swift-storage, 10Cloud-VPS (Quota-requests): Quota increase request for project swift - https://phabricator.wikimedia.org/T425975#11916688 (10KOfori) Preemptively adding my approval here as the Data Persistence manager in case that's needed for this quota increase [10:33:16] !log switch eqsin core router ibgp path to route via switches T424611 [10:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:19] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [10:34:35] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:35:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2220.codfw.wmnet with reason: host reimage [10:39:29] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:39:38] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [10:39:50] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [10:40:07] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [10:42:05] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pki2002.codfw.wmnet with reason: host reimage [10:45:17] (03CR) 10Cathal Mooney: [C:03+2] gnmic: add subscriptions to openconfig subinterface path [puppet] - 10https://gerrit.wikimedia.org/r/1278682 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [10:45:57] (03PS1) 10Fabfur: aptrepo: use package field and not source for haproxy-awslc packages [puppet] - 10https://gerrit.wikimedia.org/r/1286854 (https://phabricator.wikimedia.org/T419825) [10:47:03] (03CR) 10Slyngshede: [C:03+1] aptrepo: use package field and not source for haproxy-awslc packages [puppet] - 10https://gerrit.wikimedia.org/r/1286854 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:47:22] (03CR) 10Marostegui: [C:03+2] Revert "db2220: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1286853 (owner: 10Marostegui) [10:48:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6003.wikimedia.org [10:48:41] (03CR) 10Fabfur: [C:03+2] aptrepo: use package field and not source for haproxy-awslc packages [puppet] - 10https://gerrit.wikimedia.org/r/1286854 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:49:31] (03PS1) 10Muehlenhoff: Switch install3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286857 [10:49:38] (03PS5) 10Daniel Kinzler: Move Makefiles to standard location [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) [10:49:40] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2002.codfw.wmnet with reason: host reimage [10:52:07] (03PS2) 10Effie Mouzeli: ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) [10:52:14] (03CR) 10Effie Mouzeli: [C:03+2] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:52:19] !log installing Linux 5.10.251-4 on all Bullseye hosts [10:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:18] (03Merged) 10jenkins-bot: ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285343 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [10:55:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6003.wikimedia.org [10:55:24] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [10:55:51] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [10:58:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2220.codfw.wmnet with OS trixie [11:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1100). [11:00:45] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and (185.15.58.139) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor= - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:01:34] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:03:10] !log fceratto@cumin1003 START - Cookbook sre.mysql.major-upgrade [11:03:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1236: Upgrading db1236.eqiad.wmnet [11:04:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1236: Upgrading db1236.eqiad.wmnet [11:04:23] RESOLVED: JobUnavailable: Reduced availability for job cfssl in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:01] (03PS1) 10Cathal Mooney: gnmic: move description-to-tag processor after the subint rewrite [puppet] - 10https://gerrit.wikimedia.org/r/1286860 (https://phabricator.wikimedia.org/T424683) [11:06:34] (03PS1) 10Muehlenhoff: pki:multirootca: Enable nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1286861 [11:06:42] !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db1236.eqiad.wmnet with OS trixie [11:07:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286861 (owner: 10Muehlenhoff) [11:10:04] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool db2220: after reimage to trixie [11:11:34] PROBLEM - SSH on an-druid1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:12:41] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki2002.codfw.wmnet with OS trixie [11:14:01] (03PS1) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) [11:17:05] (03PS1) 10Muehlenhoff: ircstream: Mark the IRC port as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1286863 (https://phabricator.wikimedia.org/T149804) [11:17:11] (03PS1) 10Jelto: miscweb: mount secrets only in data-sync container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286864 (https://phabricator.wikimedia.org/T414405) [11:18:26] moritzm: pki2002 up and running, all done! Looks good to me, I'll wait for your confirmation before repooling (afk for a bit now) [11:19:26] nice! I'll have a look in ~ 5 mins [11:19:45] !log installing Linux 6.1.170-3 on all Bookworm hosts [11:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:20] (03PS1) 10Effie Mouzeli: Revert "ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286865 [11:20:48] (03PS2) 10Gkyziridis: wgRestSandboxSpecs: Add LiftWing API OpenAPI specs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) [11:21:01] (03CR) 10Blake: [C:03+1] Revert "ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286865 (owner: 10Effie Mouzeli) [11:21:13] (03CR) 10Effie Mouzeli: [C:03+2] Revert "ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286865 (owner: 10Effie Mouzeli) [11:21:44] !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [11:23:20] (03Merged) 10jenkins-bot: Revert "ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286865 (owner: 10Effie Mouzeli) [11:24:12] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/ratelimit: apply [11:24:24] (03CR) 10Jelto: [C:03+2] miscweb: mount secrets only in data-sync container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286864 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:24:37] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [11:25:44] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: host reimage [11:26:46] (03Merged) 10jenkins-bot: miscweb: mount secrets only in data-sync container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286864 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [11:27:32] !log add ibgp peering between cr1-drms and cr2-drmrs over loopback IPs T424611 [11:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:36] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [11:30:22] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [11:30:39] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [11:31:32] (03CR) 10Gmodena: Add max-batches option to cap the size of a wikibase RDF dump. (033 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [11:33:31] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [11:33:54] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [11:34:35] FIRING: DiskSpace: Disk space config-master1001:9100:/ 3.014% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:40:28] !log delete old direct ibgp peering between cr1-drms and cr2-drmrs T424611 [11:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:32] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [11:42:49] (03PS1) 10Kosta Harlan: hcaptcha: Include client Origin in subdomain proxy cache keys [puppet] - 10https://gerrit.wikimedia.org/r/1286872 (https://phabricator.wikimedia.org/T426178) [11:42:51] (03PS1) 10Kosta Harlan: hcaptcha: Include Origin in proxy cache key on hcaptcha.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1286873 (https://phabricator.wikimedia.org/T426178) [11:42:53] (03PS1) 10Kosta Harlan: hcaptcha: Remove ineffective http-level CORS add_headers [puppet] - 10https://gerrit.wikimedia.org/r/1286874 (https://phabricator.wikimedia.org/T426178) [11:43:05] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:43:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1236.eqiad.wmnet with OS trixie [11:43:54] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [11:44:01] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u9 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:44:27] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [11:48:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:53] (03PS1) 10Effie Mouzeli: ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) [11:51:06] (03CR) 10Blake: [C:03+1] ratelimit: codfw: replace rdb2007 with rdb2011 (Redis 8) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286875 (https://phabricator.wikimedia.org/T419976) (owner: 10Effie Mouzeli) [11:51:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1236: Migration of db1236.eqiad.wmnet completed [11:51:59] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:53:37] (03CR) 10Federico Ceratto: "Tested on a Trixie update:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [11:54:00] (03CR) 10Marostegui: "And all good?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1285355 (https://phabricator.wikimedia.org/T425417) (owner: 10Federico Ceratto) [11:55:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2220: after reimage to trixie [11:57:22] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update records for drmrs ibgp link - cmooney@cumin1003" [11:57:28] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update records for drmrs ibgp link - cmooney@cumin1003" [11:57:28] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:58:38] (03PS3) 10Fabfur: hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) [11:59:35] RESOLVED: DiskSpace: Disk space config-master1001:9100:/ 2.921% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=config-master1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:02:36] !log add ibgp peering between cr1-esams and cr2-esams over loopback IPs T424611 [12:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:39] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [12:03:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:09:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [12:09:53] (03CR) 10Ayounsi: [C:03+1] gnmic: move description-to-tag processor after the subint rewrite [puppet] - 10https://gerrit.wikimedia.org/r/1286860 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [12:10:32] (03PS1) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) [12:11:57] (03CR) 10Cathal Mooney: [C:03+2] gnmic: move description-to-tag processor after the subint rewrite [puppet] - 10https://gerrit.wikimedia.org/r/1286860 (https://phabricator.wikimedia.org/T424683) (owner: 10Cathal Mooney) [12:13:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [12:14:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11917083 (10MoritzMuehlenhoff) [12:18:48] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:59] (03CR) 10Slyngshede: [C:03+1] hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:20:21] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [12:21:28] (03PS1) 10Mszwarc: Fix TypeError on saving userrights interwiki [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286884 (https://phabricator.wikimedia.org/T426185) [12:21:29] (03PS2) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) [12:22:32] jouncebot: nowandnext [12:22:32] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [12:22:33] In 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1300) [12:23:17] (03CR) 10CI reject: [V:04-1] miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [12:23:52] (03PS3) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) [12:25:42] Would anyone mind if I deploy a fix for T426185 in approx. 5 minutes (once CI gets the patch merged to master)? [12:25:43] T426185: TypeError: MediaWiki\Extension\EventBus\HookHandlers\MediaWiki\UserChangeHooks::calculateUserEffectiveGroups(): Argument #1 ($user) must be of type MediaWiki\User\User, MediaWiki\User\UserIdentityValue given, called in /srv/med - https://phabricator.wikimedia.org/T426185 [12:26:01] Even stashbot is sad about that :D [12:27:52] (03CR) 10JMeybohm: [C:03+2] Enable media rate limiting on ms-fe1010 [puppet] - 10https://gerrit.wikimedia.org/r/1286808 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [12:28:02] (03CR) 10JMeybohm: [C:03+2] tlsproxy::envoy: Fix order of ratelimit actions [puppet] - 10https://gerrit.wikimedia.org/r/1286837 (https://phabricator.wikimedia.org/T414440) (owner: 10JMeybohm) [12:29:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy1003 using scap backport" [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286884 (https://phabricator.wikimedia.org/T426185) (owner: 10Mszwarc) [12:30:30] (03CR) 10Ladsgroup: "Yeah" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [12:31:21] (03PS4) 10Jelto: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) [12:33:23] (03PS1) 10Dragoniez: ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries [extensions/CentralAuth] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286890 (https://phabricator.wikimedia.org/T426033) [12:34:57] (03CR) 10Federico Ceratto: "And for going back to RW mode? I was suggesting first MariaDB then dbctl so that MW does not fail to write during the 2 steps." [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [12:35:37] (03PS1) 10Bartosz Dziewoński: Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286891 (https://phabricator.wikimedia.org/T425972) [12:35:48] (03PS1) 10Bartosz Dziewoński: Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286892 (https://phabricator.wikimedia.org/T425972) [12:37:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1236: Migration of db1236.eqiad.wmnet completed [12:37:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.major-upgrade (exit_code=0) [12:37:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286890 (https://phabricator.wikimedia.org/T426033) (owner: 10Dragoniez) [12:38:40] !log add ibgp peering between cr1-magru and cr2-magru over loopback IPs T424611 [12:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [12:40:08] (03CR) 10Ayounsi: [C:03+1] Switch install3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286857 (owner: 10Muehlenhoff) [12:40:49] !log depool cp7001 to test haproxy-awslc (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1286526) (T419825) [12:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:55] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [12:41:14] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7001.* [12:43:01] (03CR) 10Fabfur: [C:03+2] hiera: using haproxy-awslc on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1286526 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [12:43:26] (03Merged) 10jenkins-bot: Fix TypeError on saving userrights interwiki [extensions/EventBus] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286884 (https://phabricator.wikimedia.org/T426185) (owner: 10Mszwarc) [12:43:33] (03PS1) 10Bartosz Dziewoński: ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286897 (https://phabricator.wikimedia.org/T426033) [12:43:54] !log mszwarc@deploy1003 Started scap sync-world: Backport for [[gerrit:1286884|Fix TypeError on saving userrights interwiki (T426185)]] [12:43:57] T426185: TypeError: MediaWiki\Extension\EventBus\HookHandlers\MediaWiki\UserChangeHooks::calculateUserEffectiveGroups(): Argument #1 ($user) must be of type MediaWiki\User\User, MediaWiki\User\UserIdentityValue given, called in /srv/med - https://phabricator.wikimedia.org/T426185 [12:44:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286897 (https://phabricator.wikimedia.org/T426033) (owner: 10Bartosz Dziewoński) [12:45:23] (03CR) 10Lucas Werkmeister (WMDE): change logo at zh-classical wikipedia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1276284 (https://phabricator.wikimedia.org/T424128) (owner: 10WAN233) [12:45:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286891 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [12:45:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286892 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [12:45:50] !log mszwarc@deploy1003 mszwarc: Backport for [[gerrit:1286884|Fix TypeError on saving userrights interwiki (T426185)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:46:24] !log mszwarc@deploy1003 mszwarc: Continuing with deployment [12:50:36] !log mszwarc@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286884|Fix TypeError on saving userrights interwiki (T426185)]] (duration: 06m 42s) [12:50:40] T426185: TypeError: MediaWiki\Extension\EventBus\HookHandlers\MediaWiki\UserChangeHooks::calculateUserEffectiveGroups(): Argument #1 ($user) must be of type MediaWiki\User\User, MediaWiki\User\UserIdentityValue given, called in /srv/med - https://phabricator.wikimedia.org/T426185 [12:50:40] My deployment is done [12:52:23] jouncebot next [12:52:23] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1300) [12:53:08] We can probably get started with the backport window if there's no objections. [12:53:17] (03CR) 10KartikMistry: [C:03+2] cxserver: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277294 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry) [12:53:33] IMHO the backport window should start when it starts [12:53:48] (and also people should stop scheduling so many changes that there’s no hope to get through them all in one hour) [12:54:07] It's quite packed [12:54:17] but I’d be fine with kicking off the gate-and-submit build for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ArticleGuidance/+/1286359 a few minutes earlier [12:54:21] * Lucas_WMDE looks how long that usually takes [12:54:50] hm, not a lot else to go on at https://gerrit.wikimedia.org/r/q/project:mediawiki/extensions/ArticleGuidance+-branch:master [12:54:54] (new extension?) [12:55:13] but looks like on master gate-and-submit just takes a few minutes https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ArticleGuidance/+/1285928 [12:55:28] (03Merged) 10jenkins-bot: cxserver: Update cxserver to 2026-04-23-114216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1277294 (https://phabricator.wikimedia.org/T423002) (owner: 10KartikMistry) [12:56:31] (03PS1) 10Cathal Mooney: Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) [12:56:40] hi there [12:58:06] I have several patches, but can backport in one shot [12:58:11] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Goes together with I5b6bd4fedd (IMHO it’s slightly bad style to deploy a config variable before the corresponding code – I didn’t think to" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [12:58:20] (03CR) 10Xcollazo: archiva: block scraper UAs at nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [12:58:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks like a good basis for further tests!" [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:58:29] sbassett: IMHO you can +2 your change now (and then start spiderpig when the window starts) [12:58:32] ooops [12:58:35] (03CR) 10Elukey: [C:03+1] pki:multirootca: Enable nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1286861 (owner: 10Muehlenhoff) [12:58:38] that was for stephanebisson sorry [12:59:31] (03PS1) 10Fabfur: cache::haproxy: include haproxy32-awslc in checks for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1286914 (https://phabricator.wikimedia.org/T419825) [12:59:51] (03CR) 10Cathal Mooney: [C:03+1] GraphQL: replace termination_z upstream_speed with commit_rate [software/homer] - 10https://gerrit.wikimedia.org/r/1286310 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1300). [13:00:05] stephanebisson, codenamenoreste, mfossati, Dragoniez, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286863 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:00:17] RECOVERY - orchestrator resolve cache non-FQDNs on dborch1002 is OK: OK: all orchestrator resolve cache entries are FQDNs https://wikitech.wikimedia.org/wiki/Orchestrator [13:00:18] I can deploy [13:00:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:00:37] (03CR) 10Slyngshede: [C:03+1] cache::haproxy: include haproxy32-awslc in checks for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1286914 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:01:04] hi [13:01:12] o/ [13:01:20] (03CR) 10Nikerabbit: "Why is that bad style? I've seen people do it to preserve wanted behavior that is different from the extension defaults." [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:01:22] Lucas_WMDE: as you wish :-) [13:01:30] Lucas_WMDE the config change was deployed and tested on teswiki yesterday since it was already on wmf.2. Based on positive results, we decided to backport the code to wmf.1 to be able to do more testing on simplewiki. [13:01:42] ah ok, that’s fair [13:01:58] (03CR) 10Cathal Mooney: [C:03+1] "Overall lgtm, one nit below." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [13:02:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286914 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:03:24] (03Merged) 10jenkins-bot: Add configurable user-agent and sparql endpoint url [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:03:53] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1286359|Add configurable user-agent and sparql endpoint url (T425389)]] [13:03:57] T425389: Display the outline name that applies when listing Wikidata items in Article guidance - https://phabricator.wikimedia.org/T425389 [13:04:31] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=pki,name=codfw [13:05:08] 06SRE, 06Traffic, 13Patch-For-Review: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11917265 (10SLyngshede-WMF) For testing I suggest rolling out to e.g. MAGRU first. We can then test that cookies get set with ` $ curl -I -4 connect-to en.wikipedia.org:4... [13:05:44] (03PS1) 10Kosta Harlan: WikiEditor: Populate user_groups in EditAttemptStep events [extensions/WikiEditor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286917 (https://phabricator.wikimedia.org/T424010) [13:05:52] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1286359|Add configurable user-agent and sparql endpoint url (T425389)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikiEditor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286917 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [13:06:18] (03PS2) 10Jforrester: mathoid: Upgrade image to 2026-05-12-175031 with Node 24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286448 (https://phabricator.wikimedia.org/T364779) [13:06:18] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-05-05-223640 to 2026-05-12-211330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286919 (https://phabricator.wikimedia.org/T423369) [13:06:20] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-06-154732 to 2026-05-12-210548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286920 [13:06:34] (03CR) 10Fabfur: [C:03+2] cache::haproxy: include haproxy32-awslc in checks for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1286914 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:07:59] !log sbisson@deploy1003 sbisson: Continuing with deployment [13:08:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I think there’s two parts:" [extensions/ArticleGuidance] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286359 (https://phabricator.wikimedia.org/T425389) (owner: 10Sbisson) [13:08:46] mfossati: do you want to deploy your changes yourself (once stephanebisson is done) or do you need a deployer? [13:09:03] (codenamenoreste would be next in line but doesn’t seem to be here yet) [13:09:04] Lucas_WMDE: I can self-deploy [13:09:11] ok [13:11:22] (03CR) 10Ayounsi: [C:03+1] Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [13:12:11] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286359|Add configurable user-agent and sparql endpoint url (T425389)]] (duration: 08m 18s) [13:12:15] T425389: Display the outline name that applies when listing Wikidata items in Article guidance - https://phabricator.wikimedia.org/T425389 [13:12:55] Lucas_WMDE: I'm ready [13:13:08] mfossati: go ahead [13:13:16] (unless stephanebisson still has something to deploy) [13:13:27] I'm done, thanks [13:13:41] ok thx [13:14:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286518 (https://phabricator.wikimedia.org/T423658) (owner: 10Jdlrobson) [13:14:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286838 (https://phabricator.wikimedia.org/T425358) (owner: 10Marco Fossati) [13:14:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:14:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286844 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:14:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286846 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [13:14:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286847 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [13:15:08] Dragoniez, MatmaRex: is there a reason why the wmf.1 and wmf.2 backports of “ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries” are scheduled by different users? ^^ [13:15:30] should they be deployed separately or together? (and separately or together with the Promise-Non-Write-API-Action change?) [13:15:58] Lucas_WMDE, Dragoniez, MatmaRex: I'll just need a few minutes to test all patches, please bear with me [13:16:08] yup, yup, I’m just looking ahead for the next deploys [13:16:09] Not really, MatmaRex just kindly followed up on me [13:16:12] scap is all yours [13:16:16] Dragoniez: ok :) [13:16:21] (03PS2) 10Ayounsi: Replace upstream_speed with commit_rate [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) [13:16:27] \o/ [13:16:28] i think they can all go out together [13:16:28] I was too blind to realize wmf.1 is still alive [13:17:28] (03CR) 10Ayounsi: Replace upstream_speed with commit_rate (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1286311 (https://phabricator.wikimedia.org/T424839) (owner: 10Ayounsi) [13:17:33] (03PS1) 10Elukey: installserver: move pki-root's config to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1286921 [13:17:39] (03Merged) 10jenkins-bot: [Share Highlight] Exclude section edit links, footnotes from selection [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286518 (https://phabricator.wikimedia.org/T423658) (owner: 10Jdlrobson) [13:17:41] (03Merged) 10jenkins-bot: Add robust color fallbacks for QuoteCard average-color styling [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286838 (https://phabricator.wikimedia.org/T425358) (owner: 10Marco Fossati) [13:17:42] (03CR) 10CI reject: [V:04-1] Fixed card width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:17:51] (03Merged) 10jenkins-bot: Adjust image size to match fixed width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286844 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:17:53] (03Merged) 10jenkins-bot: ShareHighlight: exclude browsers that don't support CSS has [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286846 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [13:17:55] (03Merged) 10jenkins-bot: Also skip instrumentation for unsupported browsers [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286847 (https://phabricator.wikimedia.org/T424873) (owner: 10Marco Fossati) [13:18:09] FYI mfossati it seems like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1286839 may have a merge conflict with another +2ed patch [13:18:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host pki-root1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:18:31] RECOVERY - SSH on an-druid1007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:18:32] ouch [13:18:33] …https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1286839 needs a rebase? o_O [13:18:54] on it [13:18:57] it seems like they were originally proposed on `master` as stacked patched (FWICS), but then may have been cherry-picked as unstacked patches [13:18:57] ok [13:19:20] (03PS2) 10Cathal Mooney: Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) [13:19:52] (03PS3) 10Cathal Mooney: Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) [13:20:11] (03PS3) 10Slyngshede: P:idp webauthn, with database backend [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) [13:20:52] (03CR) 10Slyngshede: P:idp webauthn, with database backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [13:21:21] (03PS3) 10Marco Fossati: Fixed card width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) [13:22:54] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiEditor/+/1286917 could be bundled with something else, and doesn’t need verification [13:23:00] (03CR) 10Herron: [C:03+1] thanos/query: set thanos-query alert.query-url [puppet] - 10https://gerrit.wikimedia.org/r/1286811 (https://phabricator.wikimedia.org/T425400) (owner: 10Tiziano Fogli) [13:23:07] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282286 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [13:23:52] Lucas_WMDE: I've rebased https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1286839, shall I just hit "retry job" on SpiderPig? [13:24:01] yes, I think that’s the right fix [13:24:38] cool [13:25:01] !log installing openjdk-11 security updates [13:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:25:37] meanwhile I found this absolute gem in logsapm-watch: T426195 [13:25:38] T426195: InvalidArgumentException: Invalid language code "' + variant + '" - https://phabricator.wikimedia.org/T426195 [13:25:49] which judging by the error message must be some *hilarious* quoting mixup [13:25:55] (though I haven’t found it yet with a quick codesearch) [13:26:38] (03Merged) 10jenkins-bot: Fixed card width [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286839 (https://phabricator.wikimedia.org/T425710) (owner: 10Marco Fossati) [13:26:50] * A_smart_kitten enters the race to find the quoting mixup /hj [13:27:13] !log mfossati@deploy1003 Started scap sync-world: Backport for [[gerrit:1286518|[Share Highlight] Exclude section edit links, footnotes from selection (T423658)]], [[gerrit:1286838|Add robust color fallbacks for QuoteCard average-color styling (T425358)]], [[gerrit:1286839|Fixed card width (T425710)]], [[gerrit:1286844|Adjust image size to match fixed width (T425710)]], [[gerrit:1286846|ShareHighlight: exclude browsers th [13:27:13] at don't support CSS has (T424873)]], [[gerrit:1286847|Also skip instrumentation for unsupported browsers (T424873)]] [13:27:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki-root1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:27:20] T423658: shareHighlight - if headers were highlighlited, Share card shows "edit" word - https://phabricator.wikimedia.org/T423658 [13:27:20] T425358: Share Highlight: dark background and invisible text on Firefox - https://phabricator.wikimedia.org/T425358 [13:27:21] T425710: Make cards fixed width - https://phabricator.wikimedia.org/T425710 [13:27:21] T424873: [Share Highlights] Unsupported CSS "has" feature - https://phabricator.wikimedia.org/T424873 [13:28:50] A_smart_kitten: I found it, it’s in the content :3 [13:28:56] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Java security update - jmm@cumin2002 [13:29:09] !log mfossati@deploy1003 jdlrobson, mfossati: Backport for [[gerrit:1286518|[Share Highlight] Exclude section edit links, footnotes from selection (T423658)]], [[gerrit:1286838|Add robust color fallbacks for QuoteCard average-color styling (T425358)]], [[gerrit:1286839|Fixed card width (T425710)]], [[gerrit:1286844|Adjust image size to match fixed width (T425710)]], [[gerrit:1286846|ShareHighlight: exclude browsers that d [13:29:09] on't support CSS has (T424873)]], [[gerrit:1286847|Also skip instrumentation for unsupported browsers (T424873)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:29:25] 10ops-codfw, 06SRE, 06DC-Ops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197 (10ayounsi) 03NEW p:05Triage→03Medium [13:29:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917426 (10ayounsi) [13:29:44] tesitng, hold on [13:29:45] Lucas_WMDE: ahh, good idea to check the page content :3 [13:30:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917430 (10ayounsi) [13:30:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1286921 (owner: 10Elukey) [13:32:01] (03CR) 10Elukey: [C:03+2] installserver: move pki-root's config to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1286921 (owner: 10Elukey) [13:36:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: pod AB switches upgrade (2026) - https://phabricator.wikimedia.org/T426197#11917492 (10ayounsi) [13:36:17] !log mfossati@deploy1003 jdlrobson, mfossati: Continuing with deployment [13:36:52] (03PS3) 10Kgraessle: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) [13:36:59] (03CR) 10Jelto: [C:03+2] miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [13:37:06] (03CR) 10Ayounsi: [C:03+1] Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [13:37:43] (03PS1) 10Elukey: team-sre: modify pki's alert to notify users earlier [alerts] - 10https://gerrit.wikimedia.org/r/1286923 [13:38:37] (03CR) 10Ayounsi: [C:03+2] "self merging as it's a noop and similar to the patch sent yesterday" [puppet] - 10https://gerrit.wikimedia.org/r/1286812 (https://phabricator.wikimedia.org/T425334) (owner: 10Ayounsi) [13:39:28] (03Merged) 10jenkins-bot: miscweb: remove wmf-navigator public and private config from web container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286881 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [13:39:35] I have patch 1284900 [13:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q3 :rack/setup/install cloudvirt refresh - https://phabricator.wikimedia.org/T425088#11917503 (10elukey) 05Open→03Stalled We got confirmation from Supermicro that the root user is reserved from now on, so we need to solve T426180 before proceeding. [13:39:59] ok [13:40:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11917508 (10elukey) 05Open→03Stalled We got confirmation from Supermicro that the root user is reserved from now on, so we need to solve T426180 b... [13:40:29] !log mfossati@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286518|[Share Highlight] Exclude section edit links, footnotes from selection (T423658)]], [[gerrit:1286838|Add robust color fallbacks for QuoteCard average-color styling (T425358)]], [[gerrit:1286839|Fixed card width (T425710)]], [[gerrit:1286844|Adjust image size to match fixed width (T425710)]], [[gerrit:1286846|ShareHighlight: exclude browsers t [13:40:29] hat don't support CSS has (T424873)]], [[gerrit:1286847|Also skip instrumentation for unsupported browsers (T424873)]] (duration: 13m 16s) [13:40:40] T423658: shareHighlight - if headers were highlighlited, Share card shows "edit" word - https://phabricator.wikimedia.org/T423658 [13:40:41] T425358: Share Highlight: dark background and invisible text on Firefox - https://phabricator.wikimedia.org/T425358 [13:40:41] T425710: Make cards fixed width - https://phabricator.wikimedia.org/T425710 [13:40:41] T424873: [Share Highlights] Unsupported CSS "has" feature - https://phabricator.wikimedia.org/T424873 [13:40:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284900 (https://phabricator.wikimedia.org/T316393) (owner: 10Codename Noreste) [13:41:00] Lucas_WMDE: done, thank you [13:41:05] thanks! [13:41:13] deploying the config change for codenamenoreste now before the backports for Dragoniez and MatmaRex [13:41:40] it's about removing some user rights from user groups for FlaggedRevs on German Wikipedia [13:42:06] (03Merged) 10jenkins-bot: Completely disable MediaWiki page patrolling functions on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1284900 (https://phabricator.wikimedia.org/T316393) (owner: 10Codename Noreste) [13:42:32] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1284900|Completely disable MediaWiki page patrolling functions on German Wikipedia (T316393)]] [13:42:33] FlaggedRevs, everyone’s favorite code [13:42:36] T316393: Disable all patrolling functions on de.wikipedia - https://phabricator.wikimedia.org/T316393 [13:42:47] (03PS1) 10Jforrester: Disable wgWikiLambdaEnableAbstractClientMode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286924 (https://phabricator.wikimedia.org/T422647) [13:43:49] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286890 (https://phabricator.wikimedia.org/T426033) (owner: 10Dragoniez) [13:43:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286897 (https://phabricator.wikimedia.org/T426033) (owner: 10Bartosz Dziewoński) [13:43:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286891 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:44:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286892 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:44:04] (03CR) 10Elukey: [C:03+1] ircstream: Mark the IRC port as intentionally open to the world [puppet] - 10https://gerrit.wikimedia.org/r/1286863 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:44:28] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Backport for [[gerrit:1284900|Completely disable MediaWiki page patrolling functions on German Wikipedia (T316393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:39] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [13:44:48] codenamenoreste: please test :) [13:45:25] (03CR) 10Muehlenhoff: [C:03+2] Switch install3004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286857 (owner: 10Muehlenhoff) [13:45:50] I don't see autopatrol nor patrol in the bot and sysop user groups, and patrolmarks is gone; it works ;) [13:45:56] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, codenamenoreste: Continuing with deployment [13:46:01] (03PS5) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [13:46:17] alright, thanks for testing [13:46:32] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11917529 (10MoritzMuehlenhoff) [13:47:08] (03PS6) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) [13:47:19] (03Merged) 10jenkins-bot: ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries [extensions/CentralAuth] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286890 (https://phabricator.wikimedia.org/T426033) (owner: 10Dragoniez) [13:48:08] (03Merged) 10jenkins-bot: ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries [extensions/CentralAuth] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286897 (https://phabricator.wikimedia.org/T426033) (owner: 10Bartosz Dziewoński) [13:49:03] (03CR) 10Lerickson: "Thanks!" [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [13:49:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Java security update - jmm@cumin2002 [13:49:17] (03CR) 10Genoveva Galarza: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286924 (https://phabricator.wikimedia.org/T422647) (owner: 10Jforrester) [13:50:08] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1284900|Completely disable MediaWiki page patrolling functions on German Wikipedia (T316393)]] (duration: 07m 36s) [13:50:11] T316393: Disable all patrolling functions on de.wikipedia - https://phabricator.wikimedia.org/T316393 [13:50:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286891 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:50:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286892 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:50:36] let’s see if this one finishes in time [13:51:08] Zuul says 5 more minutes in gate-and-submit? /o\ [13:53:57] i have no idea if this would work or not, but as a wild pseudo-idea: what if, for a backport window with a *lot* of wmf/* branch backports, those backport cherry-picks were rebased on top of each other prior to the window starting? as i wonder if that would then lead to the success cache being hit for those patches when they then moved onto gate-andpsubmit [13:54:07] *gate-and-submit (i can't type) [13:54:22] oh wait [13:54:31] no that wouldn't work probably as they're gonna be in different repos [13:55:10] I don’t think the success cache is shared between test and gate-and-submit builds anyway? [13:55:14] ....unless they were told to `depends-on` each other, and have a stack of patches built that way? [13:55:14] or at least I’d be quite surprised if it is [13:55:51] Lucas_WMDE: i may be wrong; but I feel like I recall times where the gate-and-submit time for a wmf/*-branch backport has been really fast due to hitting an earlier `test` success cache [13:56:09] Lucas_WMDE: It will not be, unless the git hashes are identical (which as the branch name is part of the git hash AIUI, won't happen). [13:56:18] (03CR) 10Jforrester: [C:03+2] mathoid: Upgrade image to 2026-05-12-175031 with Node 24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286448 (https://phabricator.wikimedia.org/T364779) (owner: 10Jforrester) [13:56:24] (03PS1) 10Jelto: miscweb: fix typo in private config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286935 (https://phabricator.wikimedia.org/T414405) [13:56:44] no, branch names point to git hashes, they’re not part of the hash AFAIK… [13:56:50] but I thought the jobs would be different, at least in part [13:56:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3004.wikimedia.org [13:57:17] (03PS2) 10Jelto: miscweb: fix typo in private config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286935 (https://phabricator.wikimedia.org/T414405) [13:57:46] Sorry, to be more clear, AIUI the success cache's keys are hashed including the git pointer hash and the git branch name of each of the repos involved. [13:57:52] Or something like that. [13:57:53] (03Merged) 10jenkins-bot: Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders [core] (wmf/1.47.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1286891 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:57:58] !log repooling cp7001 to test haproxy-awslc behavior (T419825) [13:58:00] ok, that sounds more plausible ^^ [13:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:02] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [13:58:17] (03Merged) 10jenkins-bot: mathoid: Upgrade image to 2026-05-12-175031 with Node 24 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286448 (https://phabricator.wikimedia.org/T364779) (owner: 10Jforrester) [13:58:18] A_smart_kitten: maybe that was from an earlier gate-and-submit build? [13:58:35] (03Merged) 10jenkins-bot: Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders [core] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286892 (https://phabricator.wikimedia.org/T425972) (owner: 10Bartosz Dziewoński) [13:58:43] here we go [13:58:48] Lucas_WMDE: i'll do some quick searching, see if i can find something. like i say i could be mistaken (it has been known to happen, occasionally ^^) [13:58:49] so we’ll definitely run into the next window a bit, sorry 😔 [13:59:01] It's fine, we're services-only for the most part. [13:59:07] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1286890|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286897|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286891|Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders (T425972)]], [[gerrit:1286892|Add 'Promise-Non-Write-API-Action' to $wgAll [13:59:07] owedCorsHeaders (T425972)]] [13:59:08] There's a config patch to sling out but we can do that later. [13:59:12] T426033: PHP Warning: unserialize(): Error at offset 0 of 13 bytes (in CentralAuth) - https://phabricator.wikimedia.org/T426033 [13:59:13] T425972: POST by mw.ForeignApi is CORS-blocked when a Promise-Non-Write-API-Action header is provided - https://phabricator.wikimedia.org/T425972 [13:59:22] $wg all owed cors headers [13:59:28] all the cors headers you owe someone [13:59:33] (03CR) 10Ladsgroup: [C:03+1] "I will personally kill this thing in near future." [puppet] - 10https://gerrit.wikimedia.org/r/1286863 (https://phabricator.wikimedia.org/T149804) (owner: 10Muehlenhoff) [13:59:44] ^ instant !bash fodder Amir1 [13:59:48] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-canary: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [13:59:56] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [13:59:57] :D [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1400) [14:00:11] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host pki-root1002.eqiad.wmnet with OS trixie [14:00:16] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:00:52] follow me for more murderous rage comments (https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExb2xjODVkNHRtdjNwanBjdjE0MTV6NXJ1YnIxOXpyNzBkNDhsMGhqbCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/0eVM7GVxTDDKxn7OyX/giphy.gif) [14:00:52] (03CR) 10Jelto: [C:03+2] miscweb: fix typo in private config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286935 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:01:04] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:01:10] !log lucaswerkmeister-wmde@deploy1003 dragoniez, matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1286890|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286897|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286891|Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders (T425972)]], [[gerrit:1286892|Add 'Promise-Non-Write-AP [14:01:10] I-Action' to $wgAllowedCorsHeaders (T425972)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:01:20] Dragoniez, MatmaRex: please test :) [14:01:37] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:01:42] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:01:58] Lucas_WMDE: working as expected, thanks [14:02:17] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:02:47] looking [14:03:17] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1285460/1#message-7f04c1a32308fcda81e6d32cfdd647be4e3d900b FWICS that patch didn't have a gate-and-submit build prior to the one that hit the success cache (some jobs failed due to T419488 i think, but still hit the cache) [14:03:17] T419488: PostBuild changing the status of successful builds to failure for no apparent reason - https://phabricator.wikimedia.org/T419488 [14:03:22] (03Merged) 10jenkins-bot: miscweb: fix typo in private config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286935 (https://phabricator.wikimedia.org/T414405) (owner: 10Jelto) [14:03:31] Lucas_WMDE: looks good from here as well [14:03:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3004.wikimedia.org [14:03:40] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7001.* [14:03:45] !log lucaswerkmeister-wmde@deploy1003 dragoniez, matmarex, lucaswerkmeister-wmde: Continuing with deployment [14:03:47] alright, thanks! [14:03:58] * Lucas_WMDE is moving between rooms at the office simultaneously [14:04:01] praise be unto spiderpig [14:04:10] browser connections are more stable than SSH connections ;) [14:04:56] (03PS1) 10Muehlenhoff: Switch install2005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286942 [14:05:07] thanks Lucas_WMDE and Dragoniez [14:05:46] All done with the old deploy? [14:05:48] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-05-05-223640 to 2026-05-12-211330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286919 (https://phabricator.wikimedia.org/T423369) (owner: 10Jforrester) [14:06:12] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:06:48] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:07:52] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286890|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286897|ApiQueryGlobalUsers: Fix parsing logic for legacy log_params entries (T426033)]], [[gerrit:1286891|Add 'Promise-Non-Write-API-Action' to $wgAllowedCorsHeaders (T425972)]], [[gerrit:1286892|Add 'Promise-Non-Write-API-Action' to $wgAl [14:07:52] lowedCorsHeaders (T425972)]] (duration: 08m 45s) [14:07:55] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-05-05-223640 to 2026-05-12-211330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286919 (https://phabricator.wikimedia.org/T423369) (owner: 10Jforrester) [14:07:57] T426033: PHP Warning: unserialize(): Error at offset 0 of 13 bytes (in CentralAuth) - https://phabricator.wikimedia.org/T426033 [14:07:57] T425972: POST by mw.ForeignApi is CORS-blocked when a Promise-Non-Write-API-Action header is provided - https://phabricator.wikimedia.org/T425972 [14:08:29] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:29] !log UTC afternoon backport+config window done [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:35] MatmaRex, Lucas_WMDE: thank you! and sorry for squeezing this in [14:08:56] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-canary: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [14:09:01] (03CR) 10Slyngshede: profile::cache::haproxy: add webrequest-based ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:09:22] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:10:00] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:10:04] Lucas_WMDE: can I sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiEditor/+/1286917 ? [14:10:23] kostajh: up to James_F imho ^^ [14:10:37] Eh, sure, we'll wait. [14:10:44] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:10:50] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:11:09] James_F: I can do it later too, up to you [14:11:19] I'll sling my MW-one out quick. [14:11:21] (03CR) 10Jforrester: [C:03+2] Disable wgWikiLambdaEnableAbstractClientMode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286924 (https://phabricator.wikimedia.org/T422647) (owner: 10Jforrester) [14:11:32] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:11:42] James_F: ok [14:11:50] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-05-06-154732 to 2026-05-12-210548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286920 [14:11:53] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-05-06-154732 to 2026-05-12-210548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286920 (owner: 10Jforrester) [14:12:17] (03Merged) 10jenkins-bot: Disable wgWikiLambdaEnableAbstractClientMode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286924 (https://phabricator.wikimedia.org/T422647) (owner: 10Jforrester) [14:12:24] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:12:52] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1286924|Disable wgWikiLambdaEnableAbstractClientMode everywhere (T422647)]] [14:12:56] T422647: Register and document the abstract-client-mode feature flag - https://phabricator.wikimedia.org/T422647 [14:14:00] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-05-06-154732 to 2026-05-12-210548 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286920 (owner: 10Jforrester) [14:14:32] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:54] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:14:55] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1286924|Disable wgWikiLambdaEnableAbstractClientMode everywhere (T422647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:15:02] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:15:17] !log jforrester@deploy1003 jforrester: Continuing with deployment [14:15:54] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:15:57] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1003 [14:16:54] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:17:32] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1002.eqiad.wmnet with reason: host reimage [14:17:35] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:17:57] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11917713 (10elukey) Overall idea in T394476#11506446. [14:18:55] (03CR) 10Bking: [C:03+2] opensearch on k8s: Enable service mesh for clusters [puppet] - 10https://gerrit.wikimedia.org/r/1286504 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [14:19:27] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286924|Disable wgWikiLambdaEnableAbstractClientMode everywhere (T422647)]] (duration: 06m 35s) [14:19:30] T422647: Register and document the abstract-client-mode feature flag - https://phabricator.wikimedia.org/T422647 [14:20:42] kostajh: Over to you, sorry. [14:22:48] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917781 (10cmooney) [14:23:00] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917782 (10cmooney) [14:23:30] James_F: thanks! [14:23:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/WikiEditor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286917 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [14:23:47] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917784 (10cmooney) [14:24:42] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917793 (10cmooney) [14:24:58] (03CR) 10UndueMarmot: "Forgot to remove the specialized configuration settings for JsonConfig on testwiki, making creation of new Commons-based data templates on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893080 (https://phabricator.wikimedia.org/T213295) (owner: 10Zabe) [14:25:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1002.eqiad.wmnet with reason: host reimage [14:27:50] (03CR) 10Ayounsi: [C:03+1] Switch install2005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1286942 (owner: 10Muehlenhoff) [14:28:01] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:28:16] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:28:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1430) [14:30:13] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11917850 (10cmooney) In terms of the Nokia configuration for the ports connecting to the CRs set them up like this to create the two needed sub-interfaces... [14:31:27] (03CR) 10Slyngshede: profile::cache::haproxy: add webrequest-based ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:32:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11917889 (10MoritzMuehlenhoff) [14:33:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS name for uslfo network new swtiches - pt1979@cumin2002" [14:33:43] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1003 [14:33:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add missing DNS name for uslfo network new swtiches - pt1979@cumin2002" [14:33:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:58] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1003 [14:34:20] (03Merged) 10jenkins-bot: WikiEditor: Populate user_groups in EditAttemptStep events [extensions/WikiEditor] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1286917 (https://phabricator.wikimedia.org/T424010) (owner: 10Kosta Harlan) [14:34:44] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1286917|WikiEditor: Populate user_groups in EditAttemptStep events (T424010)]] [14:34:48] T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010 [14:34:56] (03PS5) 10CDanis: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:34:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:35:21] A_smart_kitten: (re 14:03 UTC): iiiiiinteresting [14:36:42] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:36:47] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1286917|WikiEditor: Populate user_groups in EditAttemptStep events (T424010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:37:28] (03PS6) 10CDanis: profile::cache::haproxy: add webrequest-based ip reputation data [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:37:31] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:37:51] !log kharlan@deploy1003 kharlan: Continuing with deployment [14:39:07] (03CR) 10ArielGlenn: "Almost ready to go, see remaining nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1282962 (https://phabricator.wikimedia.org/T424824) (owner: 10Daniel Kinzler) [14:42:02] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286917|WikiEditor: Populate user_groups in EditAttemptStep events (T424010)]] (duration: 07m 17s) [14:42:05] T424010: Collect performer implicit groups in editattemptstep for hCaptcha rollout - https://phabricator.wikimedia.org/T424010 [14:42:41] I’m done [14:43:34] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki-root1002.eqiad.wmnet with OS trixie [14:43:50] (03CR) 10CDanis: profile::cache::haproxy: add webrequest-based ip reputation data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:47:02] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11917969 (10MoritzMuehlenhoff) [14:47:03] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [14:49:43] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [14:49:45] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:49:49] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ttmserver-test: apply [14:50:09] 06SRE, 10SRE-Access-Requests: Update production SSH key for alexsanford - https://phabricator.wikimedia.org/T426210 (10ASanford-WMF) 03NEW [14:50:11] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:51:42] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1003 [14:51:44] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11917995 (10MoritzMuehlenhoff) [14:52:21] (03PS1) 10Cathal Mooney: Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) [14:53:05] (03CR) 10CI reject: [V:04-1] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [14:53:27] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for missing ulsfo subnets - cmooney@cumin1003" [14:53:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for missing ulsfo subnets - cmooney@cumin1003" [14:53:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:02] (03PS2) 10Cathal Mooney: Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) [14:54:44] (03CR) 10CI reject: [V:04-1] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [14:56:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:57:19] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:59:12] (03CR) 10Elukey: profile::cache::haproxy: add webrequest-based ip reputation data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1283821 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:59:50] (03CR) 10Dreamy Jazz: [C:03+1] "Novem do you have time to get this deployed in a puppet window, or do you want me to handle that?" [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [15:00:03] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:00:43] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:00:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:58] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:14] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:03:03] (03PS1) 10Atsuko: opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) [15:03:11] cmooney@cumin1003 netbox (PID 3631366) is awaiting input [15:04:05] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for missing ulsfo subnets - cmooney@cumin1003" [15:04:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns records for missing ulsfo subnets - cmooney@cumin1003" [15:04:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:04:14] (03PS3) 10Cathal Mooney: Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) [15:04:47] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:06:13] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:10:17] (03CR) 10Ayounsi: "overall lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [15:11:04] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-canary: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [15:11:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:11:41] (03PS1) 10Jelto: Revert "miscweb: remove wmf-navigator public and private config from web container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286958 [15:11:49] (03CR) 10CI reject: [V:04-1] Revert "miscweb: remove wmf-navigator public and private config from web container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286958 (owner: 10Jelto) [15:14:01] (03PS4) 10Cathal Mooney: Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) [15:15:06] (03CR) 10Cathal Mooney: Add remaining INCLUDE statements for ulsfo IPv6 link address ranges (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [15:15:10] (03CR) 10Ayounsi: [C:03+1] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [15:15:15] (03PS2) 10Jelto: Revert "miscweb: remove wmf-navigator public and private config from web container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286958 [15:15:22] (03CR) 10Cathal Mooney: [C:03+2] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges [dns] - 10https://gerrit.wikimedia.org/r/1286956 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [15:15:39] !log cmooney@dns2005 START - running authdns-update [15:15:42] 10SRE-SLO: SLOs: enable SLO-based alerting - https://phabricator.wikimedia.org/T425797#11918091 (10herron) p:05Triage→03Medium [15:16:56] !log cmooney@dns2005 END - running authdns-update [15:18:49] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-canary: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [15:19:03] (03PS1) 10Elukey: profile::docker_registry: rename ml bucket [puppet] - 10https://gerrit.wikimedia.org/r/1286961 (https://phabricator.wikimedia.org/T420978) [15:19:06] (03CR) 10Jelto: [C:03+2] Revert "miscweb: remove wmf-navigator public and private config from web container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286958 (owner: 10Jelto) [15:19:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:20:59] (03PS1) 10Btullis: Add new partman recipes to facilitate testing WDQS on K8S [puppet] - 10https://gerrit.wikimedia.org/r/1286962 (https://phabricator.wikimedia.org/T425653) [15:21:44] (03Merged) 10jenkins-bot: Revert "miscweb: remove wmf-navigator public and private config from web container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286958 (owner: 10Jelto) [15:21:52] (03PS1) 10Fabfur: hiera: install haproxy-awslc on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1286963 (https://phabricator.wikimedia.org/T419825) [15:22:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1286963 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:24:44] (03CR) 10Elukey: [C:03+2] profile::docker_registry: rename ml bucket [puppet] - 10https://gerrit.wikimedia.org/r/1286961 (https://phabricator.wikimedia.org/T420978) (owner: 10Elukey) [15:24:47] (03CR) 10Btullis: [C:03+2] Add new partman recipes to facilitate testing WDQS on K8S [puppet] - 10https://gerrit.wikimedia.org/r/1286962 (https://phabricator.wikimedia.org/T425653) (owner: 10Btullis) [15:24:55] (03CR) 10Ssingh: [C:03+1] hiera: install haproxy-awslc on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1286963 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:25:16] btullis: ok to merge? [15:26:26] (03CR) 10Fabfur: [C:03+2] hiera: install haproxy-awslc on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1286963 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:26:29] I went ahead, the commit seems not harmful :) [15:27:43] !log depooling cp7009 to install haproxy-awslc (T419825) [15:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:46] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [15:27:49] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.* [15:29:27] (03PS6) 10Federico Ceratto: sre.mysql.global-read-only Set all sections as RO/RW [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) [15:31:01] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:31:08] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [15:32:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:32:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [15:34:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1006:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1006 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:35:56] (03PS2) 10Dduvall: zuul: Run zuul-scheduler/-launcher/-web as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1286463 [15:35:56] (03PS1) 10Dduvall: zuul: Bump image_version to 14.2.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/1286968 [15:36:19] (03CR) 10Dduvall: [C:03+1] zuul: Run zuul-scheduler/-launcher/-web as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [15:36:26] !log repooling cp7009 to test haproxy-awslc behavior (T419825) [15:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:29] T419825: Test HAProxy 3.2 with AWS-LC libraries - https://phabricator.wikimedia.org/T419825 [15:37:32] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [15:37:33] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:37:54] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:38:02] (03PS2) 10Krinkle: Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) [15:38:38] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11918235 (10cmooney) Overall the other info in this task makes sense to me. I think we can do all the vlan renames in advance. So when we set up the swi... [15:38:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) (owner: 10Krinkle) [15:40:19] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:40:54] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:42:07] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [15:42:51] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:44:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [15:44:44] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [15:44:57] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [15:45:04] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [15:45:24] 10ops-eqsin, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: EQSIN:New switch setup/configuration - https://phabricator.wikimedia.org/T418439#11918255 (10cmooney) [15:45:28] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [15:50:09] (03PS4) 10Kgraessle: Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) [15:51:41] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:08] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [15:53:25] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [15:53:41] (03CR) 10Dzahn: [C:03+2] zuul: Bump image_version to 14.2.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/1286968 (owner: 10Dduvall) [15:59:35] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426221 (10phaultfinder) 03NEW [16:00:51] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement, 06ServiceOps new: decomission deploy2002.codfw.wmnet - https://phabricator.wikimedia.org/T426222 (10Raine) 03NEW [16:02:21] (03PS1) 10Jsn.sherman: Enable AutoModerator on Albanian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) [16:02:23] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement, 06ServiceOps new: decomission deploy2002.codfw.wmnet - https://phabricator.wikimedia.org/T426222#11918340 (10Raine) 05Open→03Stalled [16:02:23] (03PS1) 10Jsn.sherman: Enable AutoModerator on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) [16:04:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle) [16:05:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman) [16:05:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman) [16:06:52] 06SRE, 10SRE-Access-Requests: Update production SSH key for alexsanford - https://phabricator.wikimedia.org/T426210#11918359 (10Marostegui) p:05Triage→03Medium a:03Marostegui [16:08:03] (03PS1) 10Marostegui: data.yaml: Update alexsanford's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) [16:08:53] (03CR) 10CI reject: [V:04-1] data.yaml: Update alexsanford's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) (owner: 10Marostegui) [16:09:23] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:39] (03PS2) 10Marostegui: data.yaml: Update alexsanford's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) [16:10:23] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [16:10:24] (03CR) 10Marostegui: "Verified the ssh key out of band via slack too." [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) (owner: 10Marostegui) [16:10:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [16:10:54] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test1001.eqiad.wmnet with OS bookworm [16:10:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update production SSH key for alexsanford - https://phabricator.wikimedia.org/T426210#11918391 (10Marostegui) Patch ready - waiting for review [16:11:08] (03CR) 10Dzahn: "looks good to me - as long as the key has been verified out-of-band somewhere" [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) (owner: 10Marostegui) [16:11:30] (03CR) 10Dzahn: [C:03+1] data.yaml: Update alexsanford's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) (owner: 10Marostegui) [16:11:53] (03CR) 10Marostegui: [C:03+2] "Yes, checked on deployment host and via slack." [puppet] - 10https://gerrit.wikimedia.org/r/1286977 (https://phabricator.wikimedia.org/T426210) (owner: 10Marostegui) [16:12:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update production SSH key for alexsanford - https://phabricator.wikimedia.org/T426210#11918395 (10Marostegui) 05Open→03Resolved The new key has been merged, please allow 20 minutes for the change to spread across all the hosts. [16:13:41] FIRING: [19x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:42] (03CR) 10Cathal Mooney: [C:03+2] Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [16:16:26] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:04] (03Merged) 10jenkins-bot: Add new IBGP sub-interfaces to OSPF on core routers at POPs [homer/public] - 10https://gerrit.wikimedia.org/r/1286913 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [16:19:03] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:19:58] !log update OSPF config on eqsin core routers to shift traffic to switch links T424611 [16:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:04] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [16:22:02] (03CR) 10BPirkle: [C:03+1] "Change seems fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [16:23:24] (03CR) 10Dzahn: [C:03+2] zuul: Run zuul-scheduler/-launcher/-web as zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [16:28:55] (03CR) 10Marostegui: "To go back I'd do mariadb and then dbctl" [cookbooks] - 10https://gerrit.wikimedia.org/r/1277076 (https://phabricator.wikimedia.org/T419874) (owner: 10Federico Ceratto) [16:28:59] !log update OSPF config on drmrs core routers to shift traffic to switch links T424611 [16:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:03] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [16:32:37] (03PS1) 10BPirkle: Revert "Add wikibase.v1 module to the sandbox were it is present" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) [16:34:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:37:11] (03CR) 10KineticPelagic: [C:03+1] "Better luck next time, wikibase.v1 module!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) (owner: 10BPirkle) [16:37:34] FIRING: DiskSpace: Disk space build2001:9100:/ 1.435% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:38:54] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11918508 (10elukey) I moved all the https://docker-registry.wikimedia.org/ml images (vllm, various versions) to apus using the abo... [16:43:22] (03CR) 10Dzahn: [C:03+2] "it says user class does not have a parameter UID - we should use systemd.. checking" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [16:43:47] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-wdqs-test2001.codfw.wmnet with OS bookworm [16:44:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) (owner: 10BPirkle) [16:53:58] (03CR) 10Reedy: [C:04-1] "cf Slack, but I'm not sure this will work (CSP etc), never mind GitHub is a mirror, not the canonical" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [16:54:40] (03CR) 10Dzahn: [C:03+2] "ah, duh:) it's just that for the group it's gid and not uid" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [16:54:59] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncmonitor1001.eqiad.wmnet [16:55:46] (03CR) 10Dzahn: [C:03+2] "we are still supposed to migrate to systemd::sysuser though" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [16:57:26] (03CR) 10Reedy: [C:04-1] "TBH, I'm not even sure from Gerrit will work depending on how it's loaded... in the browser via JS?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [16:57:48] (03PS1) 10Dzahn: zuul: fix parameter for group (uid -> gid) [puppet] - 10https://gerrit.wikimedia.org/r/1286984 [16:58:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncmonitor1001.eqiad.wmnet [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1700) [17:02:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:03:42] !log aokoth@cumin1003 START - Cookbook sre.vrts.upgrade on VRTS host vrts1003.eqiad.wmnet [17:05:09] (03CR) 10BPirkle: "Yes, loaded in the browser via JS. The Content-Security-Policy for https://en.wikipedia.org/wiki/Special:RestSandbox includes "raw.githubu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [17:05:38] !log aokoth@cumin1003 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1003.eqiad.wmnet [17:08:02] (03CR) 10Dzahn: [C:03+2] zuul: fix parameter for group (uid -> gid) [puppet] - 10https://gerrit.wikimedia.org/r/1286984 (owner: 10Dzahn) [17:11:54] (03PS1) 10Herron: add dummy token for dashboard reporter [labs/private] - 10https://gerrit.wikimedia.org/r/1286988 [17:13:04] (03CR) 10TChin: [C:03+2] [eventgate] Bump eventgate-* to v1.30.0 and enable transforms [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285426 (https://phabricator.wikimedia.org/T415549) (owner: 10TChin) [17:14:13] (03PS7) 10BCornwall: upload: Return 400 instead of 429 for non-standard thumbnail sizes [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [17:15:25] (03Merged) 10jenkins-bot: [eventgate] Bump eventgate-* to v1.30.0 and enable transforms [deployment-charts] - 10https://gerrit.wikimedia.org/r/1285426 (https://phabricator.wikimedia.org/T415549) (owner: 10TChin) [17:15:52] (03PS3) 10Bking: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [17:16:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [17:16:33] (03PS4) 10Bking: Presto memory tuning, resource groups [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [17:16:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1285926 (https://phabricator.wikimedia.org/T424112) (owner: 10Aleksandar Mastilovic) [17:16:55] (03CR) 10BCornwall: [V:03+2 C:03+2] "Fixed some of the bad formatting that was there prior in PS7 so it doesn't keep snowballing into something uglier - Tests are still happy." [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T414805) (owner: 10Neriah) [17:19:53] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:20:24] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:22:44] (03CR) 10Dzahn: [C:03+2] "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1286984" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [17:23:38] (03CR) 10Novem Linguae: "Would be great if you could squeeze it into one of yours." [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [17:23:58] !log update OSPF config on esams core routers to shift traffic to switch links T424611 [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:02] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [17:25:24] (03CR) 10Dzahn: [C:03+2] "oof, so puppet starts the process and then can't run usermod because the process is running ..." [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [17:26:13] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:26:25] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:26:49] !log zuul1001 - stopping zuul-web; then manually running: /usr/sbin/usermod -u 923 zuul [17:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:01] !log zuul1001 systemctl start zuul-scheduler ; /usr/bin/docker exec zuul-scheduler zuul-scheduler smart-reconfigure [17:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:23] (03CR) 10Dduvall: [C:03+1] "Would a `require => User['zuul']` on the service help with that?" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [17:29:02] mutante: ah, i see you're already handling it manually. that works too [17:29:48] dduvall: yes, I think I should also use a "find -exec" to find any files owned by previous UID 498 and make the new UID own them.. been through this before on other hosts [17:29:59] ack [17:30:02] and then switch puppet to use systemd::sysuser too [17:30:05] will do [17:31:19] looks good though, nothing owned by UID 498 or GID 498 anymore [17:31:52] now on the other host(s) [17:33:23] the executor host should be fine i think since zuul-executor doesn't use the zuul user (yet) [17:33:43] yea.. we did a good job at puppetizing.. as in.. once the UID is changed and puppet ran it fixes all the ownership [17:33:54] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:34:03] nice [17:34:11] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [17:36:20] dduvall: the "smart-reconfigure" command is fine on zuul1001 but "FileNotFoundError: [Errno 2] No such file or directory" on zuul2001 [17:36:39] that's odd [17:36:47] !log update OSPF config on magru core routers to shift traffic to switch links T424611 [17:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:50] T424611: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611 [17:36:58] other than that: stop zuul-scheduler, run usermod, start zuul-scheduler, run smart-reconfigure works [17:37:22] let's run puppet another time [17:40:17] hmm yea.. still have to figure that one out [17:40:49] i see zuul-scheduler failing on sql and zk connection [17:41:39] puppet also just deployed the cert.. uhm [17:41:48] (03PS8) 10Herron: grafana-dashboard-reporter: initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1286507 (https://phabricator.wikimedia.org/T425795) [17:41:54] (03PS5) 10Herron: grafana: add dashboard reporter plugin [puppet] - 10https://gerrit.wikimedia.org/r/1286986 [17:42:25] we don't want zuul-scheduler functioning on that host atm though. it's standby [17:42:43] maybe the fix is just to properly turn it off [17:42:46] based on "not main host" [17:42:51] yeah [17:42:57] ok, can do [17:42:58] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:43:09] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8558/co" [puppet] - 10https://gerrit.wikimedia.org/r/1286986 (owner: 10Herron) [17:43:12] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:43:52] mutante: same for the executor host [17:44:44] dduvall: ok, I will make this all dependent on a single Hiera setting for "active_host" just like we do for other services [17:44:55] sounds good [17:45:02] to make switch-overs simple [17:45:23] for the build node, it may make sense to have both active at all times, but we're not even there yet :) [17:47:14] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [17:47:29] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [17:47:55] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [17:50:18] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:50:30] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:54:08] 06SRE, 10corto, 10Incident Tooling: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11918832 (10Novem_Linguae) I agree with the sentiments of this ticket. I am a trusted volunteer (I'm in the Phabricator groups acl*security and WMF-NDA), and I can... [17:57:25] (03PS2) 10Zabe: Start reading from new file tables on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1270513 (https://phabricator.wikimedia.org/T416548) [18:00:05] andre and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T1800). [18:00:13] jouncebot: naaaah [18:00:32] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [18:01:15] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [18:02:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:03:57] (03CR) 10Dzahn: [C:03+2] "I manually stopped services; then ran the "usermod" command and then let puppet fix the file permissions. checked with "find / -uid" and -" [puppet] - 10https://gerrit.wikimedia.org/r/1286463 (owner: 10Dduvall) [18:04:03] (03CR) 10Muehlenhoff: "Feel free to go ahead and deploy, I have a long tail of things to look at and will eventually get to it, but if you want to go ahead and m" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [18:05:45] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [18:09:04] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:12:30] 06SRE, 10corto, 10Incident Tooling: Increase trusted volunteer's visibility into production incidents - https://phabricator.wikimedia.org/T426137#11918926 (10Novem_Linguae) Maybe the Corto IRC bot's `create` command should be split into `create public` and `create private`, and if `create public` is selected... [18:13:15] (03PS1) 10Cathal Mooney: Reverse PTRs: add include statements for ulsfo and eqsin new ranges [dns] - 10https://gerrit.wikimedia.org/r/1286993 (https://phabricator.wikimedia.org/T424611) [18:13:35] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for ulsfo and eqsin IPs - cmooney@cumin1003" [18:13:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for ulsfo and eqsin IPs - cmooney@cumin1003" [18:13:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:13:55] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:14:34] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:14:40] RECOVERY - Confd vcl based reload on cp6009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:14:40] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [18:19:24] (03CR) 10Cathal Mooney: [C:03+2] Reverse PTRs: add include statements for ulsfo and eqsin new ranges [dns] - 10https://gerrit.wikimedia.org/r/1286993 (https://phabricator.wikimedia.org/T424611) (owner: 10Cathal Mooney) [18:19:43] !log cmooney@dns2005 START - running authdns-update [18:20:57] !log cmooney@dns2005 END - running authdns-update [18:23:02] (03PS1) 10Ebernhardson: Revert "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286997 (https://phabricator.wikimedia.org/T407432) [18:24:20] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T426221#11918991 (10Jclark-ctr) a:03Jclark-ctr ` ps1-e1-eqiad.mgmt.eqiad.wmnet #1: Sensor: Line, AA:L2, Current Value: 12.14 A (current) Thresholds: High: 12 #2: Sensor... [18:25:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286997 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [18:25:34] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:25:38] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:26:04] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:26:30] (03PS1) 10Dzahn: zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 [18:27:02] (03CR) 10CI reject: [V:04-1] zuul: replace user/group setup with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1286999 (owner: 10Dzahn) [18:27:16] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:29:52] (03PS1) 10Jdlrobson: Handle share-highlight images w/o resizeUrl [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287000 (https://phabricator.wikimedia.org/T426215) [18:36:35] (03PS1) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) [18:37:42] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:38:33] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:38:41] (03PS2) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) [18:41:09] (03CR) 10Gmodena: [C:03+1] Add max-batches option to cap the size of a wikibase RDF dump. (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [18:43:50] (03CR) 10Neriah: Disable wgNewUserMessageOnAutoCreate on all WMF wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287002 (https://phabricator.wikimedia.org/T426206) (owner: 10Neriah) [18:45:11] (03CR) 10Eric Gardner: [C:03+1] Handle share-highlight images w/o resizeUrl [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287000 (https://phabricator.wikimedia.org/T426215) (owner: 10Jdlrobson) [18:50:05] (03CR) 10Andrew Bogott: [C:03+2] labs_lvm: use ensure_packages so this can coexist with other lvm rules [puppet] - 10https://gerrit.wikimedia.org/r/1282408 (owner: 10Andrew Bogott) [18:50:29] (03CR) 10Andrew Bogott: [C:03+2] Add new class, labs_lvm_ephemeral (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [18:50:55] (03Abandoned) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [18:53:14] (03PS6) 10Dreamy Jazz: purge_securepoll: don't exclude private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [18:53:26] (03CR) 10Dreamy Jazz: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1279281 (https://phabricator.wikimedia.org/T419309) (owner: 10Novem Linguae) [18:53:48] (03PS9) 10Andrew Bogott: Add new class, labs_lvm_ephemeral [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) [18:53:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1282006 (https://phabricator.wikimedia.org/T422258) (owner: 10Andrew Bogott) [19:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:03:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:12] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [19:09:29] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [19:12:07] (03CR) 10Lerickson: Add max-batches option to cap the size of a wikibase RDF dump. (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1286487 (https://phabricator.wikimedia.org/T425036) (owner: 10Lerickson) [19:15:53] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 5 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11919194 (10Ladsgroup) [19:19:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:19:58] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:21:15] (03CR) 10Kgraessle: [C:03+1] Enable AutoModerator on Albanian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286974 (https://phabricator.wikimedia.org/T420450) (owner: 10Jsn.sherman) [19:22:14] (03CR) 10Kgraessle: [C:03+1] Enable AutoModerator on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286975 (https://phabricator.wikimedia.org/T425509) (owner: 10Jsn.sherman) [19:23:01] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:23:33] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:30:51] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Add hcaptcha checkout in production [puppet] - 10https://gerrit.wikimedia.org/r/1287007 (https://phabricator.wikimedia.org/T403829) [19:33:42] (03PS2) 10Muehlenhoff: mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) [19:34:06] (03CR) 10Ladsgroup: [C:03+2] mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [19:34:09] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Migrate section-specific DBA access rule to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [19:41:13] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "I did the dance on db1157 and it worked fine so now rolling this out to everywhere, the pre and post files are there. Just to be clear, th" [puppet] - 10https://gerrit.wikimedia.org/r/1270432 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [19:55:19] (03PS1) 10Jdlrobson: Update small size for Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) [19:55:22] (03CR) 10Jdlrobson: Update small size for Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [19:55:29] (03CR) 10Jdlrobson: [C:04-2] Update small size for Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [19:55:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [19:56:26] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:56:36] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:58:48] (03PS1) 10Bking: dse-k8s: Attempt to work around OpenSearch TLS weirdness [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287010 (https://phabricator.wikimedia.org/T421293) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T2000). [20:00:05] bpirkle, ebernhardson, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] \o [20:00:18] I'm here [20:01:21] o/ if anyone needs a deployer, lmk [20:02:17] bpirkle: do you want to self-deploy since you're 1st in the queue? happy to deploy for you if needed [20:02:28] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:02:34] o/ [20:02:38] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:02:44] It's been a few years since I did it. I wouldn't mind if you did this one. :) [20:02:48] np [20:02:48] cjming I can self deploy mine and help with any others [20:03:05] i can deploy mine, it's super easy [20:03:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:03:15] it's just turning off an ab test, going back to default setings [20:03:28] Jdlrobson: thanks! i'll do Bill's, then Erik can do his, then you can do yours! easy peasy [20:03:47] (03PS2) 10BPirkle: Revert "Add wikibase.v1 module to the sandbox were it is present" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) [20:05:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) (owner: 10BPirkle) [20:06:33] (03Merged) 10jenkins-bot: Revert "Add wikibase.v1 module to the sandbox were it is present" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286981 (https://phabricator.wikimedia.org/T422403) (owner: 10BPirkle) [20:06:59] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1286981|Revert "Add wikibase.v1 module to the sandbox were it is present" (T422403)]] [20:07:03] T422403: Create Wikibase v1 REST API Module - https://phabricator.wikimedia.org/T422403 [20:09:01] !log cjming@deploy1003 bpirkle, cjming: Backport for [[gerrit:1286981|Revert "Add wikibase.v1 module to the sandbox were it is present" (T422403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:05] bpirkle: lmk when to sync - on test servers [20:09:34] Good to go [20:09:37] !log cjming@deploy1003 bpirkle, cjming: Continuing with deployment [20:12:59] (03CR) 10Ryan Kemper: archiva: block scraper UAs at nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1286536 (https://phabricator.wikimedia.org/T426114) (owner: 10Ryan Kemper) [20:13:02] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [20:13:41] FIRING: [19x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:46] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286981|Revert "Add wikibase.v1 module to the sandbox were it is present" (T422403)]] (duration: 06m 47s) [20:13:48] bpirkle: should be live! [20:13:49] T422403: Create Wikibase v1 REST API Module - https://phabricator.wikimedia.org/T422403 [20:13:54] Thank you! [20:14:01] yw! [20:14:02] ebernhardson: all yours [20:14:07] (03CR) 10Bking: [C:03+2] dse-k8s: Attempt to work around OpenSearch TLS weirdness [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287010 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [20:14:44] cjming: thanks! [20:14:46] (03CR) 10Reedy: [C:04-1] "`*.wikimedia.org` is in there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286862 (https://phabricator.wikimedia.org/T426081) (owner: 10Gkyziridis) [20:15:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286997 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:15:06] ebernhardson: can you ping me when your done? [20:15:10] thanks in advance :) [20:16:05] (03Merged) 10jenkins-bot: Revert "cirrus: AB test query suggester variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1286997 (https://phabricator.wikimedia.org/T407432) (owner: 10Ebernhardson) [20:16:29] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1286997|Revert "cirrus: AB test query suggester variants" (T407432)]] [20:16:33] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:16:41] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:19] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:17:43] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:17:56] Jdlrobson: certainly [20:18:27] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1286997|Revert "cirrus: AB test query suggester variants" (T407432)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:03] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:19:31] !log ebernhardson@deploy1003 ebernhardson: Continuing with deployment [20:21:36] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:21:40] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:23:35] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1286997|Revert "cirrus: AB test query suggester variants" (T407432)]] (duration: 07m 06s) [20:23:39] T407432: Follow-up AB test of dym language model variants - https://phabricator.wikimedia.org/T407432 [20:23:44] Jdlrobson: all yours [20:23:51] thanks! [20:24:03] (03CR) 10Jdlrobson: Update small size for Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [20:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [20:24:59] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:25:04] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:25:12] (03Merged) 10jenkins-bot: Update small size for Swedish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287006 (https://phabricator.wikimedia.org/T424910) (owner: 10Jdlrobson) [20:25:37] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287006|Update small size for Swedish Wikipedia (T424910)]] [20:25:41] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [20:27:38] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1287006|Update small size for Swedish Wikipedia (T424910)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:59] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [20:30:03] (03PS1) 10Bking: dse-k8s: Add more values to test OpenSearch with services proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287018 (https://phabricator.wikimedia.org/T421293) [20:33:04] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287006|Update small size for Swedish Wikipedia (T424910)]] (duration: 07m 26s) [20:33:08] T424910: Limit Special:Preferences thumbnail option to three options - small, regular and large - https://phabricator.wikimedia.org/T424910 [20:33:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287000 (https://phabricator.wikimedia.org/T426215) (owner: 10Jdlrobson) [20:34:59] (03Merged) 10jenkins-bot: Handle share-highlight images w/o resizeUrl [extensions/ReaderExperiments] (wmf/1.47.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1287000 (https://phabricator.wikimedia.org/T426215) (owner: 10Jdlrobson) [20:35:26] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287000|Handle share-highlight images w/o resizeUrl (T426215)]] [20:35:29] T426215: Add fallback for “Only resize when resizeUrl actually exists” bug - https://phabricator.wikimedia.org/T426215 [20:37:23] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1287000|Handle share-highlight images w/o resizeUrl (T426215)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:37:34] FIRING: DiskSpace: Disk space build2001:9100:/ 1.431% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:38:52] !log jdlrobson@deploy1003 jdlrobson: Continuing with deployment [20:40:31] (03CR) 10Bking: [C:03+2] dse-k8s: Add more values to test OpenSearch with services proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287018 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [20:41:05] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [20:41:36] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [20:42:07] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:42:58] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287000|Handle share-highlight images w/o resizeUrl (T426215)]] (duration: 07m 32s) [20:43:02] T426215: Add fallback for “Only resize when resizeUrl actually exists” bug - https://phabricator.wikimedia.org/T426215 [20:43:14] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:43:28] done! [20:43:30] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:43:33] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:43:39] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:43:47] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-semantic-search: apply [20:43:48] (03PS1) 10Ladsgroup: wgThumbLimits: Remove the exception for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287022 (https://phabricator.wikimedia.org/T376152) [20:45:48] Jdlrobson: ^ [20:46:06] Amir1: we're good to backport that now? [20:46:16] yeah [20:46:20] jouncebot: nowandnext [20:46:20] For the next 0 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T2000) [20:46:20] In 0 hour(s) and 13 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T2100) [20:46:20] k on it [20:46:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287022 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [20:46:36] thanks [20:47:27] (03Merged) 10jenkins-bot: wgThumbLimits: Remove the exception for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287022 (https://phabricator.wikimedia.org/T376152) (owner: 10Ladsgroup) [20:47:54] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1287022|wgThumbLimits: Remove the exception for itwikiquote (T376152)]] [20:47:58] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [20:48:48] Jdlrobson: to confirm, all values of zero should become 2 (except on svwiki) right? [20:49:05] i.e. 0 or 1 should be changed to 2 [20:49:10] (2=180) [20:49:56] !log jdlrobson@deploy1003 ladsgroup, jdlrobson: Backport for [[gerrit:1287022|wgThumbLimits: Remove the exception for itwikiquote (T376152)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:50:25] (03PS1) 10Thcipriani: Phabricator: require config before scap [puppet] - 10https://gerrit.wikimedia.org/r/1287023 (https://phabricator.wikimedia.org/T424055) [20:51:33] !log jdlrobson@deploy1003 ladsgroup, jdlrobson: Continuing with deployment [20:51:35] Amir1: done! [20:51:45] \o/ [20:51:48] Amir1: double checking the svwiki thing now [20:53:12] The change to Italian Wikiquote has moved the numbers, right? [20:53:23] Since Italian Wikiquote was prepending the value. [20:54:09] let me check [20:55:27] isn't it prepending? [20:55:36] ugh sorry, appending [20:55:42] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1287022|wgThumbLimits: Remove the exception for itwikiquote (T376152)]] (duration: 07m 48s) [20:55:46] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [20:56:02] +itwikquote prepends [20:56:18] so [ 360, 150, 180, 200, 220, 250, 300, 400 ] became [ 120, 150, 180, 200, 220, 250, 300, 400 ] [20:56:50] let me triple check [20:57:40] sigh, yeah [20:57:48] it moves it for some people [20:57:59] I try to fix it, it's like two users [20:59:31] fixed [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T2100) [21:00:31] sorry i have a meeting now [21:00:41] you';ll have my attention in 1hr exactly! [21:06:37] 06SRE, 06Content-Transform-Team, 06ServiceOps new, 06Wikipedia-Android-App-Backlog (Android Release - FY2025-26): Investigate Code 411 error when selecting zh-classical (lzh) language from article toolbar - https://phabricator.wikimedia.org/T425545#11919943 (10ABorbaWMF) Appears to be fixed on 50586-r-2026... [21:06:52] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: Restart for upgrade to JVM 11.0.31 - eevans@cumin1003 [21:07:18] (03CR) 10Bking: [C:03+1] opensearch-ttmserver: switch to opensearch 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1286957 (https://phabricator.wikimedia.org/T425377) (owner: 10Atsuko) [21:12:04] !log remapping thumbsize of 0 to 2 in all group0 wikis (T376152) [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:08] T376152: Evaluate feasibility of deprecating (or limiting) user media size preferences - https://phabricator.wikimedia.org/T376152 [21:25:17] (03PS1) 10Ahmon Dancy: .gitignore: Add /static/hcaptcha/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1287026 (https://phabricator.wikimedia.org/T403829) [21:26:18] (03PS1) 10Bking: OpenSearch on K8s: change services proxy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1287027 (https://phabricator.wikimedia.org/T421293) [21:32:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1287027 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:37:33] (03CR) 10Bking: [C:03+2] OpenSearch on K8s: change services proxy upstreams [puppet] - 10https://gerrit.wikimedia.org/r/1287027 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [21:56:05] Amir1: back [22:00:04] Deploy window Readers deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260513T2200) [22:06:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:07:13] !incidents [22:07:14] 7928 (UNACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [22:07:14] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [22:07:32] !ack 7928 [22:07:32] 7928 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [22:13:26] FIRING: [20x] SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:17:12] !incidents [22:17:13] 7928 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [22:17:13] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [22:21:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:41:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:42:10] !incidents [22:42:11] 7928 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [22:42:11] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [22:43:00] (03PS1) 10Dzahn: zuul: make all service_ensures dependent on a single active server [puppet] - 10https://gerrit.wikimedia.org/r/1287035 [22:44:23] (03PS2) 10Dzahn: zuul: make all service_ensures dependent on a single active server [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) [22:46:13] (03CR) 10Dzahn: "What makes the most sense? Should we do it like this and define one single "active server" (per role, so one for "main" and one for "execu" [puppet] - 10https://gerrit.wikimedia.org/r/1287035 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:56:51] FIRING: [5x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:57:05] !incidents [22:57:06] 7928 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [22:57:06] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [23:06:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:10:30] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287039 (https://phabricator.wikimedia.org/T393434) [23:11:03] !incidents [23:11:04] 7928 (ACKED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [23:11:04] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [23:14:26] (03PS2) 10Santiago Faci: Test Kitchen UI: Deploy v1.3.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1287039 (https://phabricator.wikimedia.org/T393434) [23:26:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:30:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1286528 (owner: 10TrainBranchBot) [23:31:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:36:51] RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from performance.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:38:32] !incidents [23:38:32] 7928 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet) [23:38:32] 7926 (RESOLVED) ATSBackendErrorsHigh cache_text sre (performance.discovery.wmnet magru) [23:39:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287041 [23:39:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287041 (owner: 10TrainBranchBot) [23:52:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1287041 (owner: 10TrainBranchBot)