[00:00:34] (03PS5) 10Ahmon Dancy: python-build/bookworm/Dockerfile.template: Modernize [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 [00:06:02] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [00:08:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:08:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175618 [00:08:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175618 (owner: 10TrainBranchBot) [00:08:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [00:08:29] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:08:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:09:52] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2007 to codfw - jhancock@cumin1003" [00:09:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2007 to codfw - jhancock@cumin1003" [00:09:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:10:05] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dbprov2007 [00:10:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbprov2007 [00:10:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:16:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:22:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:22:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [00:25:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:28:25] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:28:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm [00:29:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:30:46] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175618 (owner: 10TrainBranchBot) [00:38:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:47:33] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [00:51:07] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:53:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [00:59:19] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2007'] [01:00:30] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbprov2007'] [01:00:39] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:03:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dbprov2007.codfw.wmnet with OS bookworm [01:04:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11059488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm [01:06:27] (03PS1) 10Krinkle: Profiler: Support php-xhprof besides php-tideways-xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) [01:06:29] (03PS1) 10Krinkle: Profiler: Remove support for php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) [01:07:15] (03CR) 10CI reject: [V:04-1] Profiler: Support php-xhprof besides php-tideways-xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [01:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.13 [core] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175622 (https://phabricator.wikimedia.org/T396374) [01:07:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.13 [core] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175622 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [01:10:28] (03PS2) 10Krinkle: Profiler: Add php-xhprof support besides php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) [01:10:29] (03PS2) 10Krinkle: Profiler: Remove support for php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) [01:11:37] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 57s) [01:13:11] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [01:13:35] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [01:16:16] vriley@cumin1002 reimage (PID 1476872) is awaiting input [01:16:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:16:40] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dbprov2007 [01:17:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbprov2007 [01:19:48] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.45.0-wmf.13 [core] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175622 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [01:20:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11059494 (10Jhancock.wm) a:03Jhancock.wm [01:22:09] (03PS1) 10Krinkle: mediawiki: install php8.1-xhprof [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) [01:22:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11059497 (10Jhancock.wm) note to self: configured the wrong port on the switch. need to delete and redo. should be quick. [01:34:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [01:34:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1043.eqiad.wmnet with OS bookworm [01:34:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm completed: - cloudcephosd1043 (**PASS**... [01:37:32] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:41:57] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:45:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:47:22] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [01:47:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [01:55:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0200) [02:09:00] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:15:59] vriley@cumin1002 reimage (PID 1487681) is awaiting input [02:19:39] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:19:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [02:20:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:23:58] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:24:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:24:18] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2007.codfw.wmnet with OS bookworm [02:24:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11059520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm executed with errors: - dbprov20... [02:29:32] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:13] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 184992 MB (4% inode=99%): /var/lib/hadoop/data/e 171516 MB (4% inode=99%): /var/lib/hadoop/data/f 149054 MB (3% inode=99%): /var/lib/hadoop/data/b 165010 MB (4% inode=99%): /var/lib/hadoop/data/g 161388 MB (4% inode=99%): /var/lib/hadoop/data/d 157763 MB (4% inode=99%): /var/lib/hadoop/data/j 165477 MB (4% inode=99%): /var/lib/hadoop/data [02:36:13] 6 MB (4% inode=99%): /var/lib/hadoop/data/h 167159 MB (4% inode=99%): /var/lib/hadoop/data/l 159112 MB (4% inode=99%): /var/lib/hadoop/data/k 151380 MB (4% inode=99%): /var/lib/hadoop/data/m 160107 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [02:37:09] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:41:14] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [02:41:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:41:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [02:45:36] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [02:45:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059526 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [02:48:32] (03PS1) 10Andrew Bogott: Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) [02:48:34] (03PS1) 10Andrew Bogott: Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) [02:48:57] (03CR) 10CI reject: [V:04-1] Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:49:08] (03CR) 10CI reject: [V:04-1] Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:51:41] (03PS2) 10Andrew Bogott: Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) [02:51:41] (03PS2) 10Andrew Bogott: Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) [02:52:05] (03CR) 10CI reject: [V:04-1] Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:52:17] (03CR) 10CI reject: [V:04-1] Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:55:30] (03PS3) 10Andrew Bogott: Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) [02:55:30] (03PS3) 10Andrew Bogott: Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) [02:55:55] (03CR) 10CI reject: [V:04-1] Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:56:06] (03CR) 10CI reject: [V:04-1] Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [02:57:40] (03PS4) 10Andrew Bogott: Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) [02:57:40] (03PS4) 10Andrew Bogott: Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0300) [03:06:42] vriley@cumin1002 reimage (PID 1494585) is awaiting input [03:07:45] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [03:07:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [03:08:42] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [03:08:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [03:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:28:58] vriley@cumin1002 reimage (PID 1497778) is awaiting input [03:44:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [03:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [03:44:59] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:45:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059538 (10VRiley-WMF) [03:49:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:52:55] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:53:14] (03CR) 10Thcipriani: [C:03+2] Branch commit for wmf/1.45.0-wmf.13 [core] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175622 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [03:57:00] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [03:57:27] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.13 [core] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175622 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0400) [04:01:55] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.10 (duration: 01m 53s) [04:02:48] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:06:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:07:35] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [04:07:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059553 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [04:27:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [04:27:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [04:28:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059561 (10VRiley-WMF) cloudcephosd1043 was able to fishish with "bookworm" however, cloudcephosd1042 is still having issues. [04:45:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:45:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:50:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:02:40] (03PS1) 10Tim Starling: Authorize self for Google Search Console [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175631 (https://phabricator.wikimedia.org/T400023) [05:09:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0600). [06:09:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:44] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:32] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:26] (03CR) 10Jforrester: [C:03+1] "Looks sensible. Not tested locally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [06:33:36] !log repooling wdqs1016 [06:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:27] !log restarting blazegraph on wdqs1021 (stuck) [06:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:17] RESOLVED: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:43:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:48:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:00:04] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:09] !log repooling wdqs1021 [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:35] (03PS1) 10Slyngshede: data.yaml: offboarding lmata [puppet] - 10https://gerrit.wikimedia.org/r/1175779 [07:03:17] (03CR) 10Slyngshede: [C:03+2] data.yaml: offboarding lmata [puppet] - 10https://gerrit.wikimedia.org/r/1175779 (owner: 10Slyngshede) [07:07:26] (03PS1) 10Jelto: add more providers to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1175781 (https://phabricator.wikimedia.org/T401003) [07:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:36:02] (03CR) 10Hashar: "Usually the `google-site-verification` is added to DNS. There are plenty of them there. In `templates/wikimedia.org` you could add:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175631 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [07:37:12] (03PS1) 10Slyngshede: data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374) [07:37:54] (03CR) 10CI reject: [V:04-1] data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374) (owner: 10Slyngshede) [07:39:21] (03PS2) 10Slyngshede: data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374) [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T0800) [08:04:18] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:05:50] o/ [08:08:21] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw Nokia switches mgmt - ayounsi@cumin1003" [08:08:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw Nokia switches mgmt - ayounsi@cumin1003" [08:08:26] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:09:07] well [08:09:20] I have to figure out why testwikis did not get updated overnight [08:11:45] (03PS1) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [08:12:10] (03CR) 10CI reject: [V:04-1] varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [08:12:37] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bookworm [08:13:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:43] hashar: the branch cut failed, but Tyler merged the change manually last night: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1175622 I think running we can try running the pre-sync again [08:15:01] ahhh [08:15:26] so we got the new branch at least [08:15:33] yeah [08:16:26] (03PS1) 10Tim Starling: Authorize self for Google Search Console [dns] - 10https://gerrit.wikimedia.org/r/1175842 (https://phabricator.wikimedia.org/T400023) [08:16:43] PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:16:57] ^ expected, reimage [08:17:07] (03CR) 10CI reject: [V:04-1] Authorize self for Google Search Console [dns] - 10https://gerrit.wikimedia.org/r/1175842 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [08:17:53] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175843 (https://phabricator.wikimedia.org/T396374) [08:17:55] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175843 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:18:08] !log train: sudo systemctl start train-presync # T396374 [08:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:11] T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374 [08:18:33] jnuche: turns out my sudo rule to re rerun the train-presync works :] [08:18:44] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175843 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:18:56] 🎉 [08:19:07] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.13 refs T396374 [08:20:35] (03PS2) 10Tim Starling: Authorize self for Google Search Console [dns] - 10https://gerrit.wikimedia.org/r/1175842 (https://phabricator.wikimedia.org/T400023) [08:21:16] (03CR) 10CI reject: [V:04-1] Authorize self for Google Search Console [dns] - 10https://gerrit.wikimedia.org/r/1175842 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [08:23:37] (03Abandoned) 10Tim Starling: Authorize self for Google Search Console [dns] - 10https://gerrit.wikimedia.org/r/1175842 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [08:25:02] (03CR) 10Hashar: [C:03+1] "I have suggested to add it to `operations/dns` but gdnsd rejects the TXT field with:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175631 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [08:29:44] the image is being published [08:31:20] trains <3 [08:33:23] ^ agreed [08:36:50] jnuche: thanks for the ping earlier about the branch cut failure, I might not have noticed it [08:37:53] hashar: np! [08:38:20] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f2-codfw [08:38:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-codfw [08:45:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:47:24] 07Puppet, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11059899 (10Lucas_Werkmeister_WMDE) a:05Lucas_Werkmeister_W... [08:50:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:54:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:54:04] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f4-codfw [08:54:15] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f4-codfw [08:54:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:54:20] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f4-codfw [08:54:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f4-codfw [08:54:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T399728)', diff saved to https://phabricator.wikimedia.org/P80783 and previous config saved to /var/cache/conftool/dbconfig/20250805-085424-fceratto.json [08:54:28] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:54:30] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11059908 (10Joe) >>! In T400881#11057528, @Tgr wrote:... [08:55:31] (03CR) 10Brouberol: [C:03+1] "/waves goodbye to all of that" [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [08:56:15] (03PS1) 10Btullis: Stop deploying mediawiki to deployment-snapshot05 in beta [puppet] - 10https://gerrit.wikimedia.org/r/1175845 (https://phabricator.wikimedia.org/T398438) [08:56:56] (03CR) 10Brouberol: [C:03+1] Complete cleanup of clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [08:56:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399728)', diff saved to https://phabricator.wikimedia.org/P80784 and previous config saved to /var/cache/conftool/dbconfig/20250805-085658-fceratto.json [08:58:03] (03CR) 10Btullis: [C:03+2] Stop deploying mediawiki to deployment-snapshot05 in beta [puppet] - 10https://gerrit.wikimedia.org/r/1175845 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [08:59:14] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e5-codfw [08:59:19] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.13 refs T396374 (duration: 40m 12s) [08:59:23] T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374 [08:59:23] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e5-codfw [08:59:41] (03PS3) 10Btullis: Complete cleanup of clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) [08:59:41] (03PS3) 10Btullis: Remove all puppet code related to snapshot and dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) [08:59:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175631 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [09:01:39] (03Merged) 10jenkins-bot: Authorize self for Google Search Console [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175631 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [09:02:16] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]] [09:02:19] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [09:02:35] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e4-codfw [09:02:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e4-codfw [09:07:38] !log hashar@deploy1003 tstarling, hashar: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:07:41] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [09:07:57] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e2-codfw [09:08:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e2-codfw [09:12:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P80785 and previous config saved to /var/cache/conftool/dbconfig/20250805-091206-fceratto.json [09:12:48] !log hashar@deploy1003 tstarling, hashar: Continuing with sync [09:13:00] verified with XWikimediaDebug [09:13:41] (03CR) 10Vgutierrez: [C:03+1] "looks good, please solve the merge conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/1152311 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [09:14:24] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11059989 (10Sadiya.Mohammed_WMDE) [09:16:59] (03PS1) 10Majavah: P:toolforge::static: Set absolute_redirect off, not on [puppet] - 10https://gerrit.wikimedia.org/r/1175846 (https://phabricator.wikimedia.org/T401024) [09:17:29] (03CR) 10Btullis: [C:03+2] Complete cleanup of clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [09:19:52] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Set absolute_redirect off, not on [puppet] - 10https://gerrit.wikimedia.org/r/1175846 (https://phabricator.wikimedia.org/T401024) (owner: 10Majavah) [09:20:07] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]] (duration: 17m 50s) [09:20:10] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [09:23:30] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175848 (https://phabricator.wikimedia.org/T396374) [09:23:32] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175848 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [09:23:49] (03PS4) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) [09:24:16] (03CR) 10Vgutierrez: [C:03+1] "After a second read of https://varnish-cache.org/docs/trunk/reference/vmod.html#private-pointers-and-objects it looks like VRT_priv_task_g" [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [09:24:26] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175848 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [09:27:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P80786 and previous config saved to /var/cache/conftool/dbconfig/20250805-092714-fceratto.json [09:28:54] (03PS1) 10Tim Starling: Revert "Authorize self for Google Search Console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 [09:29:02] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f4-codfw [09:29:02] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f4-codfw [09:30:31] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f4-codfw [09:30:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f4-codfw [09:31:23] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11060043 (10Sadiya.Mohammed_WMDE) [09:31:42] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.13 refs T396374 [09:31:45] T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374 [09:33:21] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f4-codfw [09:33:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f4-codfw [09:34:22] !log jelto@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2002.wikimedia.org with OS bookworm [09:35:17] (03CR) 10Clément Goubert: mw::maintenance: ExperimentationLab periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [09:36:11] (03CR) 10Vgutierrez: text-frontend: enforcement of UA policy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [09:37:24] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bookworm [09:39:57] (03PS1) 10Tim Starling: In robots.txt permit access to the sitemap API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175851 (https://phabricator.wikimedia.org/T400023) [09:42:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T399728)', diff saved to https://phabricator.wikimedia.org/P80787 and previous config saved to /var/cache/conftool/dbconfig/20250805-094221-fceratto.json [09:42:25] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:42:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:42:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T399728)', diff saved to https://phabricator.wikimedia.org/P80788 and previous config saved to /var/cache/conftool/dbconfig/20250805-094244-fceratto.json [09:45:15] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-e4-codfw [09:45:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e4-codfw [09:45:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399728)', diff saved to https://phabricator.wikimedia.org/P80789 and previous config saved to /var/cache/conftool/dbconfig/20250805-094533-fceratto.json [09:47:53] (03CR) 10Vgutierrez: [C:04-1] "linter doesn't care about rate VS irate AFAIK, we got the same issues on other metrics that aren't frequently reported, the fix here AFAIK" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [09:51:17] !log ayounsi@cumin1003 START - Cookbook sre.network.tls for network device lsw1-f2-codfw [09:51:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f2-codfw [09:55:41] !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [09:56:13] (03CR) 10Btullis: [C:03+2] Remove all puppet code related to snapshot and dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [09:59:20] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1000) [10:00:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P80790 and previous config saved to /var/cache/conftool/dbconfig/20250805-100040-fceratto.json [10:03:01] (03CR) 10Hashar: [C:03+2] In robots.txt permit access to the sitemap API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175851 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [10:03:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175851 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [10:03:49] (03Merged) 10jenkins-bot: In robots.txt permit access to the sitemap API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175851 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [10:04:12] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] [10:04:17] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [10:04:17] T396684: Sitemaps API - https://phabricator.wikimedia.org/T396684 [10:06:01] !log hashar@deploy1003 tstarling, hashar: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:06:57] !log hashar@deploy1003 tstarling, hashar: Continuing with sync [10:08:38] RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [10:09:15] (03PS5) 10Vgutierrez: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [10:09:33] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts snapshot1010.eqiad.wmnet [10:10:29] (03CR) 10Vgutierrez: text-frontend: enforcement of UA policy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [10:12:13] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] (duration: 08m 01s) [10:12:18] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [10:12:18] T396684: Sitemaps API - https://phabricator.wikimedia.org/T396684 [10:13:10] btullis@cumin1003 decommission (PID 673803) is awaiting input [10:15:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P80791 and previous config saved to /var/cache/conftool/dbconfig/20250805-101548-fceratto.json [10:18:23] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:20:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 149929 MB (3% inode=99%): /var/lib/hadoop/data/h 166795 MB (4% inode=99%): /var/lib/hadoop/data/b 160430 MB (4% inode=99%): /var/lib/hadoop/data/k 157139 MB (4% inode=99%): /var/lib/hadoop/data/m 158301 MB (4% inode=99%): /var/lib/hadoop/data/f 163991 MB (4% inode=99%): /var/lib/hadoop/data/j 166531 MB (4% inode=99%): /var/lib/hadoop/data [10:20:32] 0 MB (4% inode=99%): /var/lib/hadoop/data/l 164455 MB (4% inode=99%): /var/lib/hadoop/data/i 164691 MB (4% inode=99%): /var/lib/hadoop/data/g 164070 MB (4% inode=99%): /var/lib/hadoop/data/c 157459 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [10:20:58] (03CR) 10Vgutierrez: "Fixed the last batch of syntax errors.. text varnishtests are now happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [10:21:02] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [10:23:08] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bookworm [10:24:03] btullis@cumin1003 decommission (PID 673803) is awaiting input [10:24:24] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174971 (owner: 10PipelineBot) [10:24:28] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174528 (owner: 10PipelineBot) [10:24:31] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173994 (owner: 10PipelineBot) [10:24:36] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173373 (owner: 10PipelineBot) [10:24:40] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170629 (owner: 10PipelineBot) [10:24:44] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:27:34] PROBLEM - Host gitlab2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:28:04] RECOVERY - Host gitlab2002 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [10:29:38] ^ expected [10:30:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T399728)', diff saved to https://phabricator.wikimedia.org/P80792 and previous config saved to /var/cache/conftool/dbconfig/20250805-103055-fceratto.json [10:31:02] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:31:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:32:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:32:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T399728)', diff saved to https://phabricator.wikimedia.org/P80793 and previous config saved to /var/cache/conftool/dbconfig/20250805-103213-fceratto.json [10:34:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399728)', diff saved to https://phabricator.wikimedia.org/P80794 and previous config saved to /var/cache/conftool/dbconfig/20250805-103451-fceratto.json [10:36:05] !log Ran fixStuckGlobalRename.php for T400974 [10:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:08] T400974: Unblock stuck global rename of Renamed user 5401aafa5557bf5c36b752af3b938b14 - https://phabricator.wikimedia.org/T400974 [10:39:55] !log Ran fixStuckGlobalRename.php for T400862 [10:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] T400862: Unblock stuck global rename of Renamed user f74bdbce92f61493475fa5230c4922b0 - https://phabricator.wikimedia.org/T400862 [10:47:12] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [10:47:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1010.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [10:47:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:47:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1010.eqiad.wmnet [10:48:53] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T401182#11060298 (10BTullis) [10:49:09] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T401182#11060303 (10BTullis) a:05BTullis→03None [10:49:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P80795 and previous config saved to /var/cache/conftool/dbconfig/20250805-104959-fceratto.json [10:50:05] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts dumpsdata1003.eqiad.wmnet [10:55:42] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:56:05] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@62138e1] (releasing): T401180 [10:56:08] T401180: Jenkins fails to connect to node on releases2003 - https://phabricator.wikimedia.org/T401180 [10:56:37] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@62138e1] (releasing): T401180 (duration: 00m 32s) [10:57:15] xSavitar: fyi there's now a --sal flag for mwscript-k8s :) see https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Starting_a_maintenance_script [11:00:21] oooooh [11:01:23] btullis@cumin1003 decommission (PID 679746) is awaiting input [11:05:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P80796 and previous config saved to /var/cache/conftool/dbconfig/20250805-110506-fceratto.json [11:06:13] claime, oh nice. Thank you very much! I'll update our docs accordingly. [11:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:12:02] btullis@cumin1003 decommission (PID 679746) is awaiting input [11:12:25] (03PS1) 10Ayounsi: Add hostname to a couple errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1175867 [11:14:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [11:15:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dumpsdata1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [11:15:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:15:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dumpsdata1003.eqiad.wmnet [11:20:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T399728)', diff saved to https://phabricator.wikimedia.org/P80797 and previous config saved to /var/cache/conftool/dbconfig/20250805-112014-fceratto.json [11:20:18] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:20:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:20:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T399728)', diff saved to https://phabricator.wikimedia.org/P80798 and previous config saved to /var/cache/conftool/dbconfig/20250805-112036-fceratto.json [11:23:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399728)', diff saved to https://phabricator.wikimedia.org/P80799 and previous config saved to /var/cache/conftool/dbconfig/20250805-112312-fceratto.json [11:24:40] (03PS1) 10Ayounsi: Replace SONIC grpc port with Nokia's in MR ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1175872 [11:27:16] (03PS1) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [11:27:39] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [11:27:50] (03CR) 10CI reject: [V:04-1] swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [11:38:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P80800 and previous config saved to /var/cache/conftool/dbconfig/20250805-113820-fceratto.json [11:47:57] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for the updates <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175478 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [11:48:20] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM, thanks <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175477 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [11:48:39] (03PS1) 10Hnowlan: profile::hcaptcha: add missing private config to subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1175876 [11:51:22] (03PS2) 10Hnowlan: profile::hcaptcha: add missing private configs to subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1175876 [11:52:22] (03PS1) 10Jelto: gerrit: accept wikimania traffic [puppet] - 10https://gerrit.wikimedia.org/r/1175877 [11:53:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P80801 and previous config saved to /var/cache/conftool/dbconfig/20250805-115327-fceratto.json [11:54:18] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6497/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175877 (owner: 10Jelto) [11:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:56:08] (03CR) 10Jelto: [V:03+1 C:03+2] "this can be reverted after wikimania" [puppet] - 10https://gerrit.wikimedia.org/r/1175877 (owner: 10Jelto) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1200) [12:03:56] (03PS2) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:04:35] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy latest image for langid on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175477 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [12:06:33] (03Merged) 10jenkins-bot: ml-services: Deploy latest image for langid on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175477 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [12:08:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T399728)', diff saved to https://phabricator.wikimedia.org/P80802 and previous config saved to /var/cache/conftool/dbconfig/20250805-120835-fceratto.json [12:08:38] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:08:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:08:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T399728)', diff saved to https://phabricator.wikimedia.org/P80803 and previous config saved to /var/cache/conftool/dbconfig/20250805-120857-fceratto.json [12:11:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399728)', diff saved to https://phabricator.wikimedia.org/P80805 and previous config saved to /var/cache/conftool/dbconfig/20250805-121132-fceratto.json [12:11:40] (03PS2) 10Krinkle: mediawiki: install php8.1-xhprof on beta cluster and mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) [12:11:43] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [12:11:51] (03PS3) 10Krinkle: mediawiki: install php8.1-xhprof on beta cluster and mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) [12:11:53] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [12:15:38] (03PS3) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:16:04] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:20:32] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 147738 MB (3% inode=99%): /var/lib/hadoop/data/h 158630 MB (4% inode=99%): /var/lib/hadoop/data/b 158345 MB (4% inode=99%): /var/lib/hadoop/data/k 158809 MB (4% inode=99%): /var/lib/hadoop/data/m 158315 MB (4% inode=99%): /var/lib/hadoop/data/f 168121 MB (4% inode=99%): /var/lib/hadoop/data/j 158768 MB (4% inode=99%): /var/lib/hadoop/data [12:20:32] 7 MB (4% inode=99%): /var/lib/hadoop/data/l 157798 MB (4% inode=99%): /var/lib/hadoop/data/i 159412 MB (4% inode=99%): /var/lib/hadoop/data/g 157984 MB (4% inode=99%): /var/lib/hadoop/data/c 159579 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [12:26:16] (03PS4) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:26:27] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:26:27] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:26:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P80806 and previous config saved to /var/cache/conftool/dbconfig/20250805-122640-fceratto.json [12:26:51] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [12:29:41] (03CR) 10Elukey: "Left some comments, not sure how much we can/should modify this script but some parts are probably not really needed and potentially confu" [puppet] - 10https://gerrit.wikimedia.org/r/1175141 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [12:30:13] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy latest images for articletopic-outlink-model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175478 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [12:31:42] (03PS3) 10Elukey: pyrra::isito: add revision label [puppet] - 10https://gerrit.wikimedia.org/r/1175564 (owner: 10Herron) [12:31:48] (03PS1) 10Jgreen: Update nsca_frack.cfg.erb remove frdb1003, add frdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/1175886 (https://phabricator.wikimedia.org/T369922) [12:32:24] (03Merged) 10jenkins-bot: ml-services: Deploy latest images for articletopic-outlink-model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175478 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [12:34:37] (03PS1) 10Ayounsi: gNMI: initial Nokia support [puppet] - 10https://gerrit.wikimedia.org/r/1175887 [12:35:12] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T401181#11060595 (10BTullis) a:05BTullis→03None [12:35:16] (03PS5) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:35:24] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:35:25] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:35:39] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:38:28] (03PS2) 10Ayounsi: gNMI: initial Nokia support [puppet] - 10https://gerrit.wikimedia.org/r/1175887 [12:38:58] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175887 (owner: 10Ayounsi) [12:39:38] (03PS1) 10Gkyziridis: ml-services: Deploy revertrisk-language-agnostic latest published image on prod. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) [12:40:59] (03PS6) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:41:10] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:41:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:41:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P80807 and previous config saved to /var/cache/conftool/dbconfig/20250805-124147-fceratto.json [12:42:32] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:44:08] (03CR) 10Filippo Giunchedi: "As of this week I moved teams, thus I'm adding non-OOO o11y folks for review/deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1175886 (https://phabricator.wikimedia.org/T369922) (owner: 10Jgreen) [12:44:33] (03PS1) 10Brouberol: data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) [12:45:07] (03CR) 10CI reject: [V:04-1] data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:45:51] (03PS7) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [12:46:05] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:46:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:49:39] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [12:50:34] (03PS1) 10Elukey: site.pp: add new cp2xxx hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) [12:50:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:52:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:53:58] (03CR) 10Vgutierrez: [C:03+1] install_server: fix cacheproxy-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1175549 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [12:55:41] (03CR) 10Vgutierrez: site.pp: add new cp2xxx hosts as insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [12:56:01] (03PS2) 10Brouberol: data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) [12:56:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T399728)', diff saved to https://phabricator.wikimedia.org/P80809 and previous config saved to /var/cache/conftool/dbconfig/20250805-125655-fceratto.json [12:57:03] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:57:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:57:15] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:57:16] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:57:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T399728)', diff saved to https://phabricator.wikimedia.org/P80810 and previous config saved to /var/cache/conftool/dbconfig/20250805-125719-fceratto.json [12:59:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399728)', diff saved to https://phabricator.wikimedia.org/P80811 and previous config saved to /var/cache/conftool/dbconfig/20250805-125952-fceratto.json [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1300). [13:00:05] jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:42] jan_drewniak: https://gerrit.wikimedia.org/r/c/wikimedia/portals/deploy/+/1175148 looks like it was already merged, does it still need deployment? [13:01:22] Lucas_WMDE: I can take care of the portal deployment (it actually needs a config change, that I'm about to upload) [13:01:22] (03PS2) 10Elukey: site.pp: add new cp2xxx hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) [13:01:27] ok [13:01:30] (03CR) 10Elukey: site.pp: add new cp2xxx hosts as insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:01:40] (03CR) 10Elukey: [C:03+2] install_server: fix cacheproxy-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1175549 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:01:46] How "big" are our mw app servers these days? and or the k8s servers that mw now runs on, and or the mw containers that now run on k8s? (in terms of resources cpu and memoy etc)? [13:02:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:02:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:02:42] jan_drewniak: actually, can you give me a moment on the deployment server? I’d like to slightly change one of the currently deployed security patches, and then let that roll out with your scap and check that it works correctly [13:02:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:03:11] (the change would be in test files, so shouldn’t have any effect, I’m just not sure how the git stuff with the patches would work out and so I’d like to include it in a deployment) [13:03:17] Lucas_WMDE: yeah of course, take your time I'm still creating the config patch [13:03:23] ok thanks 👍 [13:04:06] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175895 (https://phabricator.wikimedia.org/T128546) [13:04:19] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2331.codfw.wmnet with OS bookworm [13:04:25] (03PS1) 10Brouberol: global_config: delete airflow external services [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) [13:05:48] ok, I updated /srv/patches [13:05:58] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175888 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [13:06:04] and hopefully scap will just apply the new version of the patch on the core repository [13:06:10] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:06:47] (03CR) 10Ssingh: "Post-merge +1, worth trying for sure." [puppet] - 10https://gerrit.wikimedia.org/r/1175549 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:07:09] okay to deploy as far as I’m concerned [13:08:11] Lucas_WMDE: Ok thanks, going to deploy the portal banners now [13:09:06] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175895 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:09:15] ok, thanks! [13:09:41] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:10:05] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175895 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:10:42] (03PS8) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [13:10:51] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [13:14:34] (03PS9) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [13:14:45] (03PS1) 10Brouberol: global_config: delete airflow external services [puppet] - 10https://gerrit.wikimedia.org/r/1175897 (https://phabricator.wikimedia.org/T390941) [13:14:45] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [13:15:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P80812 and previous config saved to /var/cache/conftool/dbconfig/20250805-131500-fceratto.json [13:17:30] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2331.codfw.wmnet with reason: host reimage [13:17:37] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6498/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175897 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:18:05] (03CR) 10Bking: [C:03+1] data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:18:12] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:18:40] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1175895| Bumping portals to master (T128546)]] (duration: 07m 07s) [13:18:43] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [13:20:09] (03CR) 10Bking: [C:03+1] global_config: delete airflow external services [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:20:26] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1175895| Bumping portals to master (T128546)]] (duration: 01m 45s) [13:23:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2331.codfw.wmnet with reason: host reimage [13:23:44] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11060698 (10Tgr) So basically we need a ForeignAPIRep... [13:24:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#11060700 (10elukey) To keep archives happy - I have provisioned and reimaged with Bookworm wikikube-worker2331 since I don't need it anymore for... [13:25:14] jouncebot: nowandnext [13:25:14] For the next 0 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1300) [13:25:14] In 1 hour(s) and 4 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1430) [13:27:41] (03PS1) 10Máté Szabó: UserInfoCard: Cap maximum count for thanks given/received [extensions/CheckUser] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175899 (https://phabricator.wikimedia.org/T398354) [13:27:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175899 (https://phabricator.wikimedia.org/T398354) (owner: 10Máté Szabó) [13:28:05] (this will take a while - i18n) [13:29:14] (03CR) 10Brouberol: [C:03+2] global_config: delete airflow external services [puppet] - 10https://gerrit.wikimedia.org/r/1175896 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:29:39] (03CR) 10Vgutierrez: [C:03+1] site.pp: add new cp2xxx hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:30:03] jan_drewniak: did your scap print anything about applying the git patches? I don’t see the change applied in /srv/mediawiki-staging yet 🤔 [13:30:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P80813 and previous config saved to /var/cache/conftool/dbconfig/20250805-133007-fceratto.json [13:30:12] (maybe scap just skipped it because the change you backported only touched portals) [13:30:51] also relevant for mszabo – I made a minor change in /srv/patches, if scap complains about the patches not applying then it’s probably my fault 😅 [13:31:01] (03CR) 10Btullis: data: define ML-related user and group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:31:53] (03CR) 10Elukey: [C:03+2] site.pp: add new cp2xxx hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/1175893 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:32:02] Lucas_WMDE: That might be because the portal patch is actually a git submodule, so you can see this patch in mediawiki-config https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175895 and then this one in the portals dir https://gerrit.wikimedia.org/r/c/wikimedia/portals/deploy/+/1175148 [13:34:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:40] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:40:05] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11060794 (10Papaul) @klausman hello hope all is well. Is it possible to give us a day and time when you will be available to help us work on those servers? Thank you. it shouldn't take more th... [13:40:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:40:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2331.codfw.wmnet with OS bookworm [13:40:39] (03CR) 10Brouberol: data: define ML-related user and group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:40:41] (03PS3) 10Brouberol: data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) [13:41:05] (03Merged) 10jenkins-bot: UserInfoCard: Cap maximum count for thanks given/received [extensions/CheckUser] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1175899 (https://phabricator.wikimedia.org/T398354) (owner: 10Máté Szabó) [13:41:33] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1175899|UserInfoCard: Cap maximum count for thanks given/received (T398354)]] [13:41:36] T398354: UserInfoCard: `Thanks received` or `given` should show 1000+ if count is 1000 - https://phabricator.wikimedia.org/T398354 [13:42:57] ok, now I see the fixed patch applied \o/ [13:43:04] (03CR) 10Ottomata: "Right okay, can do." [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [13:45:08] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [13:45:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T399728)', diff saved to https://phabricator.wikimedia.org/P80814 and previous config saved to /var/cache/conftool/dbconfig/20250805-134515-fceratto.json [13:45:21] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:45:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1202.eqiad.wmnet with reason: Maintenance [13:45:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T399728)', diff saved to https://phabricator.wikimedia.org/P80815 and previous config saved to /var/cache/conftool/dbconfig/20250805-134539-fceratto.json [13:46:06] (03CR) 10Ayounsi: [C:03+2] k8s: replace legacy codfw vlans with future legacy eqiad vlans [puppet] - 10https://gerrit.wikimedia.org/r/1172001 (owner: 10Ayounsi) [13:48:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399728)', diff saved to https://phabricator.wikimedia.org/P80816 and previous config saved to /var/cache/conftool/dbconfig/20250805-134814-fceratto.json [13:49:16] (03CR) 10Clément Goubert: [C:03+2] BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi) [13:49:22] (03CR) 10Vgutierrez: [C:04-1] "that won't make that specific linter check happy, in fact 8ee666cf disables the mentioned check for an alert query that only uses `up{}`" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [13:50:34] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [13:51:02] (03CR) 10Ottomata: "Yeah, and trying the query with up{} still yields no results. Okay." [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [13:51:12] (03CR) 10Hashar: [C:04-1] "When I did it for gerrit I went to drop the file in the Apache document root, that was sufficient to be verified once. I deleted the file" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling) [13:52:10] (03PS2) 10Ottomata: HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) [13:52:15] (03CR) 10Btullis: [C:03+1] data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:52:30] (03CR) 10Brouberol: [C:03+2] data: define ML-related user and group [puppet] - 10https://gerrit.wikimedia.org/r/1175890 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:55:47] (03PS1) 10Btullis: Add more dse-k8s-worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1175902 (https://phabricator.wikimedia.org/T398438) [13:55:48] (03PS1) 10Btullis: Remove last references to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1175903 (https://phabricator.wikimedia.org/T398438) [13:56:29] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [13:56:53] (03Merged) 10jenkins-bot: BGPPeers nodeSelector: remove old codfw rows, add future eqiad pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172004 (https://phabricator.wikimedia.org/T333948) (owner: 10Ayounsi) [13:59:50] (03CR) 10Brouberol: [C:03+1] Add more dse-k8s-worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1175902 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:59:50] (03CR) 10Ottomata: "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [14:00:00] (03CR) 10Brouberol: [C:03+1] Remove last references to snapshot servers [puppet] - 10https://gerrit.wikimedia.org/r/1175903 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:00:14] (03CR) 10Btullis: [C:03+2] Add more dse-k8s-worker nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1175902 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:01:22] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:01:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:01] (03CR) 10Brouberol: Add more dse-k8s-worker nodes to site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175902 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:02:39] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1175899|UserInfoCard: Cap maximum count for thanks given/received (T398354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:43] T398354: UserInfoCard: `Thanks received` or `given` should show 1000+ if count is 1000 - https://phabricator.wikimedia.org/T398354 [14:02:52] (03CR) 10Aleksandar Mastilovic: "Could you also leave a little comment explaining why it's disabled?" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [14:03:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P80817 and previous config saved to /var/cache/conftool/dbconfig/20250805-140321-fceratto.json [14:03:52] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:04:24] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:05:40] !log mszabo@deploy1003 mszabo: Continuing with sync [14:06:36] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1011 to dse-k8s-worker1015 [14:06:40] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:06:55] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:08:19] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:09:37] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:09:59] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:12:09] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:12:46] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:13:06] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:13:56] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [14:14:44] !log cgoubert@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [14:14:52] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:15:11] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:15:49] !log cgoubert@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:16:02] btullis@cumin1003 rename (PID 701883) is awaiting input [14:16:09] !log cgoubert@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:16:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:31] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:17:40] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:17:51] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11060968 (10MatthewVernon) [14:17:53] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175899|UserInfoCard: Cap maximum count for thanks given/received (T398354)]] (duration: 36m 20s) [14:17:55] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:17:56] T398354: UserInfoCard: `Thanks received` or `given` should show 1000+ if count is 1000 - https://phabricator.wikimedia.org/T398354 [14:18:17] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:18:29] (03PS1) 10Máté Szabó: UserInfoCard: Fix UA exclusion in stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175907 [14:18:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P80818 and previous config saved to /var/cache/conftool/dbconfig/20250805-141829-fceratto.json [14:24:09] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [14:24:12] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:26:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1430) [14:30:26] btullis@cumin1003 rename (PID 701883) is awaiting input [14:33:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T399728)', diff saved to https://phabricator.wikimedia.org/P80819 and previous config saved to /var/cache/conftool/dbconfig/20250805-143336-fceratto.json [14:33:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:33:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:33:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T399728)', diff saved to https://phabricator.wikimedia.org/P80820 and previous config saved to /var/cache/conftool/dbconfig/20250805-143359-fceratto.json [14:34:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399728)', diff saved to https://phabricator.wikimedia.org/P80821 and previous config saved to /var/cache/conftool/dbconfig/20250805-143646-fceratto.json [14:36:51] (03PS1) 10Brouberol: Create the analytics-ml-users group stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175908 (https://phabricator.wikimedia.org/T400902) [14:37:18] (03PS2) 10Brouberol: Create the analytics-ml-users group stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175908 (https://phabricator.wikimedia.org/T400902) [14:37:35] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175908 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:37:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:39:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:40:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:42:12] (03PS1) 10Btullis: Remove the netbox/5.65.10.in-addr.arpa zone from list of includes [dns] - 10https://gerrit.wikimedia.org/r/1175909 (https://phabricator.wikimedia.org/T398438) [14:42:21] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:43:01] (03PS1) 10Brouberol: airflow-ml: update principal to reflect recent change to analytics-ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175910 (https://phabricator.wikimedia.org/T400902) [14:43:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [14:43:39] (03CR) 10Btullis: [C:03+1] Create the analytics-ml-users group stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175908 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:44:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [14:45:14] (03CR) 10Brouberol: [C:03+2] airflow-ml: update principal to reflect recent change to analytics-ml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175910 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:45:26] (03CR) 10Brouberol: [C:03+2] Create the analytics-ml-users group stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175908 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:47:30] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2007 to codfw - jhancock@cumin1003" [14:49:32] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding dbprov2007 to codfw - jhancock@cumin1003" [14:49:32] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:49:35] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:51:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P80822 and previous config saved to /var/cache/conftool/dbconfig/20250805-145153-fceratto.json [14:54:45] (03PS10) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) [14:54:53] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [14:55:52] jhancock@cumin1003 netbox (PID 706008) is awaiting input [14:56:16] (03CR) 10Scott French: [C:03+1] "Thanks, Ahmon! This looks good to me. Unless @cgoubert@wikimedia.org has additional thoughts, let me know when you're ready to merge and I" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [14:57:15] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11061137 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm hey, i'm gonna close this ticket cause 2091... [14:57:18] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [14:58:16] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cirrussearch2089.mgmt:22 - https://phabricator.wikimedia.org/T399943#11061155 (10Jhancock.wm) update since closing T400099, still working with dell on troubleshooting issues to get parts sent to cover this. sent a reply email yesterday, hoping for par... [14:58:30] (03Abandoned) 10Brouberol: global_config: delete airflow external services [puppet] - 10https://gerrit.wikimedia.org/r/1175897 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [14:59:00] (03PS3) 10Ottomata: HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) [14:59:20] (03CR) 10Ottomata: "Done." [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [14:59:43] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T399494#11061160 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm resolving cause issue in T399494 is the cause of the issue. [15:00:05] jelto, arnoldokoth, and mutante: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1500). [15:00:23] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [15:03:23] 10ops-eqiad, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T401210 (10phaultfinder) 03NEW [15:03:37] (03CR) 10Dzahn: [C:03+1] add more providers to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1175781 (https://phabricator.wikimedia.org/T401003) (owner: 10Jelto) [15:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11061192 (10VRiley-WMF) 05Open→03Resolved [15:07:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P80823 and previous config saved to /var/cache/conftool/dbconfig/20250805-150701-fceratto.json [15:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:44] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:35] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:10:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T401181#11061237 (10VRiley-WMF) This unit has been removed [15:11:11] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [15:11:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T401181#11061238 (10VRiley-WMF) 05Open→03Resolved [15:11:27] (03CR) 10Ssingh: [C:03+1] Remove the netbox/5.65.10.in-addr.arpa zone from list of includes [dns] - 10https://gerrit.wikimedia.org/r/1175909 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:12:32] (03CR) 10Elukey: swift configure_disks: make short, less variable IDs for JBOD disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [15:14:23] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phab deploy [15:14:53] (03CR) 10Ssingh: [C:03+2] Remove the netbox/5.65.10.in-addr.arpa zone from list of includes [dns] - 10https://gerrit.wikimedia.org/r/1175909 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:14:55] (03CR) 10Vgutierrez: [C:03+1] HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [15:14:57] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phab deploy [15:17:07] !log dzahn@cumin2002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on phab.wmfusercontent.org with reason: version upgrade [15:17:48] !log sukhe@dns1004 START - running authdns-update [15:18:42] !log sukhe@dns1004 END - running authdns-update [15:19:23] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [15:19:44] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:22] !log brennen@deploy1003 Started deploy [phabricator/deployment@7b907e8]: deploy phab2002 for T401213 [15:20:25] T401213: Deploy Phabricator/Phorge 2025-08-05 - https://phabricator.wikimedia.org/T401213 [15:20:33] (03CR) 10Ahmon Dancy: "I'm ready any time." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [15:21:04] !log brennen@deploy1003 Finished deploy [phabricator/deployment@7b907e8]: deploy phab2002 for T401213 (duration: 00m 41s) [15:21:05] (03CR) 10Aleksandar Mastilovic: [C:03+1] HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [15:21:59] (03CR) 10Elukey: "Hey Daniel! I think the code change is fine, IIUC the VMs are not part of a cluster to zookeeper will run in standalone mode, that is tota" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [15:22:03] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T399728)', diff saved to https://phabricator.wikimedia.org/P80824 and previous config saved to /var/cache/conftool/dbconfig/20250805-152208-fceratto.json [15:22:12] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:22:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1253.eqiad.wmnet with reason: Maintenance [15:22:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T399728)', diff saved to https://phabricator.wikimedia.org/P80825 and previous config saved to /var/cache/conftool/dbconfig/20250805-152232-fceratto.json [15:24:39] !log brennen@deploy1003 Started deploy [phabricator/deployment@7b907e8]: deploy phab1004 for T401213 [15:24:45] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1011 to dse-k8s-worker1015 - btullis@cumin1003" [15:25:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1011 to dse-k8s-worker1015 - btullis@cumin1003" [15:25:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:02] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1015 on all recursors [15:25:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1015 on all recursors [15:25:06] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1015 [15:25:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399728)', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20250805-152515-fceratto.json [15:25:20] !log brennen@deploy1003 Finished deploy [phabricator/deployment@7b907e8]: deploy phab1004 for T401213 (duration: 00m 40s) [15:26:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1015 [15:27:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1011 to dse-k8s-worker1015 [15:27:28] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1012 to dse-k8s-worker1016 [15:27:48] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [15:31:10] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1012 to dse-k8s-worker1016 - btullis@cumin1003" [15:32:12] (03PS1) 10Jasmine: site.pp: assign wikikube-ctrl2006 to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1175912 (https://phabricator.wikimedia.org/T400661) [15:32:18] (03CR) 10Tchanders: [C:04-1] "Thanks for this - the list of wikis looks good, matches the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [15:34:14] btullis@cumin1003 rename (PID 711433) is awaiting input [15:35:09] (03CR) 10Scott French: [V:03+2 C:03+1] "Build locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [15:35:59] (03CR) 10Ottomata: [C:03+2] HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [15:36:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1012 to dse-k8s-worker1016 - btullis@cumin1003" [15:36:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:41] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1016 on all recursors [15:36:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1016 on all recursors [15:36:45] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1016 [15:36:47] (03CR) 10Scott French: [V:03+2 C:03+2] python-build/bookworm/Dockerfile.template: Modernize [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [15:37:25] (03Merged) 10jenkins-bot: HaproxyKafkaDeliveryErrors - pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [15:38:02] (03CR) 10Clément Goubert: [C:03+1] site.pp: assign wikikube-ctrl2006 to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1175912 (https://phabricator.wikimedia.org/T400661) (owner: 10Jasmine) [15:38:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T401182#11061406 (10VRiley-WMF) [15:38:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T401182#11061409 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This server has been decommed [15:39:51] btullis@cumin1003 rename (PID 711433) is awaiting input [15:40:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P80826 and previous config saved to /var/cache/conftool/dbconfig/20250805-154023-fceratto.json [15:40:31] (03CR) 10Scott French: [V:03+2 C:03+2] "All done:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [15:41:00] (03CR) 10Ahmon Dancy: "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [15:43:05] (03CR) 10Ottomata: [C:03+1] UserInfoCard: Fix UA exclusion in stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175907 (owner: 10Máté Szabó) [15:48:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1016 [15:49:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1012 to dse-k8s-worker1016 [15:50:04] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175100 (owner: 10PipelineBot) [15:50:30] (03PS1) 10Cwhite: docs: remove references to permissive schema [software/ecs] - 10https://gerrit.wikimedia.org/r/1175913 [15:51:05] (03PS1) 10Elukey: installserver: fix preseed for new cp2xxx nodes [puppet] - 10https://gerrit.wikimedia.org/r/1175914 (https://phabricator.wikimedia.org/T392851) [15:51:25] FIRING: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:33] (03CR) 10Cwhite: [C:03+2] docs: remove references to permissive schema [software/ecs] - 10https://gerrit.wikimedia.org/r/1175913 (owner: 10Cwhite) [15:51:46] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175100 (owner: 10PipelineBot) [15:51:58] (03Merged) 10jenkins-bot: docs: remove references to permissive schema [software/ecs] - 10https://gerrit.wikimedia.org/r/1175913 (owner: 10Cwhite) [15:52:25] (03CR) 10Ssingh: [C:03+1] installserver: fix preseed for new cp2xxx nodes [puppet] - 10https://gerrit.wikimedia.org/r/1175914 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:54:20] (03PS1) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [15:54:50] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [15:55:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P80827 and previous config saved to /var/cache/conftool/dbconfig/20250805-155530-fceratto.json [15:56:17] (03CR) 10Elukey: [C:03+2] installserver: fix preseed for new cp2xxx nodes [puppet] - 10https://gerrit.wikimedia.org/r/1175914 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:56:25] RESOLVED: SystemdUnitFailed: nginx.service on urldownloader1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:53] (03PS2) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [16:00:05] jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:21] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [16:00:46] (03CR) 10MVernon: swift configure_disks: make short, less variable IDs for JBOD disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175874 (https://phabricator.wikimedia.org/T401127) (owner: 10MVernon) [16:01:13] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [16:02:26] jouncebot: nowandnext [16:02:26] For the next 0 hour(s) and 57 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1600) [16:02:26] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1700) [16:02:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175907 (owner: 10Máté Szabó) [16:03:36] (03Merged) 10jenkins-bot: UserInfoCard: Fix UA exclusion in stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175907 (owner: 10Máté Szabó) [16:04:00] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1175907|UserInfoCard: Fix UA exclusion in stream config]] [16:06:31] (03PS4) 10Krinkle: mediawiki: install php8.1-xhprof on beta cluster and mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) [16:06:34] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:07:58] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1175907|UserInfoCard: Fix UA exclusion in stream config]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:08:29] !log mszabo@deploy1003 mszabo: Continuing with sync [16:09:10] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1013 to dse-k8s-worker1017 [16:09:30] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:10:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T399728)', diff saved to https://phabricator.wikimedia.org/P80828 and previous config saved to /var/cache/conftool/dbconfig/20250805-161038-fceratto.json [16:10:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:10:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:13:09] (03CR) 10Krinkle: "https://puppet-compiler.wmflabs.org/output/1175623/4651/mwdebug1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:14:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:15] btullis@cumin1003 rename (PID 715138) is awaiting input [16:15:16] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:15:35] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175907|UserInfoCard: Fix UA exclusion in stream config]] (duration: 11m 34s) [16:16:59] (03CR) 10CDanis: [C:03+1] P:toolforge::legacy_redirector: Add NEL headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175480 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [16:17:00] (03PS1) 10Sergio Gimeno: [Growth] beta: enable new leveling up notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175919 (https://phabricator.wikimedia.org/T400118) [16:18:44] (03CR) 10Cparle: [C:03+1] image-suggestion: reconfigure for data-gateway listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [16:19:01] (03PS2) 10Majavah: P:toolforge::legacy_redirector: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1175480 (https://phabricator.wikimedia.org/T400994) [16:25:25] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1013 to dse-k8s-worker1017 - btullis@cumin1003" [16:25:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1013 to dse-k8s-worker1017 - btullis@cumin1003" [16:25:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:42] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1017 on all recursors [16:25:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1017 on all recursors [16:25:46] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1017 [16:27:03] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [16:28:52] btullis@cumin1003 rename (PID 715138) is awaiting input [16:30:00] 10ops-codfw, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225 (10Jhancock.wm) 03NEW [16:30:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1017 [16:31:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1013 to dse-k8s-worker1017 [16:32:11] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:32:18] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:32:26] (03PS1) 10BBlack: Remove privs and SSH for user joanna [puppet] - 10https://gerrit.wikimedia.org/r/1175920 [16:33:00] (03CR) 10Thcipriani: add zoekt from upstream and blubber builder config to build it (031 comment) [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [16:33:55] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from snapshot1015 to dse-k8s-worker1018 [16:33:59] (03CR) 10BBlack: [C:03+2] Remove privs and SSH for user joanna [puppet] - 10https://gerrit.wikimedia.org/r/1175920 (owner: 10BBlack) [16:34:15] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:34:16] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:34:46] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:39:55] (03CR) 10Btullis: [C:03+2] Update the dashboard for the dumps cephfs volume [alerts] - 10https://gerrit.wikimedia.org/r/1175515 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [16:40:11] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1015 to dse-k8s-worker1018 - btullis@cumin1003" [16:41:22] (03Merged) 10jenkins-bot: Update the dashboard for the dumps cephfs volume [alerts] - 10https://gerrit.wikimedia.org/r/1175515 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [16:42:01] (03CR) 10Scott French: "Thanks, Timo! One issue, but otherwise I think this should do the trick." [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:43:15] btullis@cumin1003 rename (PID 718396) is awaiting input [16:44:07] (03PS5) 10Krinkle: mediawiki: install php8.1-xhprof on beta cluster and mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) [16:44:16] (03CR) 10Krinkle: mediawiki: install php8.1-xhprof on beta cluster and mwdebug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:44:38] !log bblack@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jobo out of all services on: 2396 hosts [16:45:35] (03CR) 10Btullis: opensearch-operator: Add chart for review (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [16:45:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming snapshot1015 to dse-k8s-worker1018 - btullis@cumin1003" [16:45:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:48] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache dse-k8s-worker1018 on all recursors [16:45:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-worker1018 on all recursors [16:45:52] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1018 [16:47:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:47:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1018 [16:47:30] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:47:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from snapshot1015 to dse-k8s-worker1018 [16:48:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1167.eqiad.wmnet with reason: Maintenance [16:48:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:49:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T399728)', diff saved to https://phabricator.wikimedia.org/P80829 and previous config saved to /var/cache/conftool/dbconfig/20250805-164902-fceratto.json [16:49:06] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:50:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:53:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T399728)', diff saved to https://phabricator.wikimedia.org/P80830 and previous config saved to /var/cache/conftool/dbconfig/20250805-165312-fceratto.json [16:53:30] (03Merged) 10jenkins-bot: Profiler: Add php-xhprof support besides php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175620 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:53:54] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1175620|Profiler: Add php-xhprof support besides php-tideways_xhprof (T401152)]] [16:53:57] T401152: Switch wmf-config/Profiler from Tideways to XHProf - https://phabricator.wikimedia.org/T401152 [16:55:40] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1175620|Profiler: Add php-xhprof support besides php-tideways_xhprof (T401152)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:56:05] !log bblack@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jobo out of all services on: 2395 hosts [16:57:41] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11061851 (10Papaul) Did some testing again today because all the connection from the Juniper spines to the Nokia switches were not coming up after enabling them on both ends. 1 - replace the... [16:58:45] (03CR) 10Scott French: [C:03+1] "Thanks, Timo!" [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [16:59:51] !log krinkle@deploy1003 krinkle: Continuing with sync [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1700) [17:00:16] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175922 (https://phabricator.wikimedia.org/T390007) [17:02:07] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:02:10] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [17:03:56] FYI, we'll be using the infra window to deploy a follow-on change related to the ongoing backport [17:05:09] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175620|Profiler: Add php-xhprof support besides php-tideways_xhprof (T401152)]] (duration: 11m 15s) [17:05:13] T401152: Switch wmf-config/Profiler from Tideways to XHProf - https://phabricator.wikimedia.org/T401152 [17:05:14] (03CR) 10CDanis: [C:03+2] Enable profile::auto_restarts::service for hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1092195 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:05:19] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175922 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza) [17:05:23] swfrench-wmf: all yours. [17:05:55] Krinkle: great, thanks! I'll ping you when k8s mw-debug is ready for testing [17:07:02] !log swfrench@deploy1003 Started scap sync-world: Migrate debug and cli images to xhprof - T401152 [17:07:43] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175922 (https://phabricator.wikimedia.org/T390007) (owner: 10DDesouza) [17:07:55] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:07:57] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:07:59] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:08:01] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:08:02] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:08:05] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:08:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80831 and previous config saved to /var/cache/conftool/dbconfig/20250805-170820-fceratto.json [17:08:24] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:08:26] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:08:27] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:08:29] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:08:30] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:08:33] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:08:40] !log swfrench@deploy1003 swfrench: Migrate debug and cli images to xhprof - T401152 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:08:57] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:08:59] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:09:00] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:09:02] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:09:04] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:09:06] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:09:40] Krinkle: if you might be able to kick the tires on profiling in mw-debug, that would be swell [17:09:44] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:03] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:10:05] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:10:06] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:10:08] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:10:09] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:10:12] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:10:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:10:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:11:56] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:11:58] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:11:59] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:12:01] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:12:02] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:12:05] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:14:03] (03CR) 10Dr0ptp4kt: [C:03+1] "Looping Andrew here as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175907 (owner: 10Máté Szabó) [17:14:05] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:14:08] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [17:14:25] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:14:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:14:43] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:14:45] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:14:46] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:14:48] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:14:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:14:49] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:14:49] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:14:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:15:22] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:15:24] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:16:06] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:19:49] !log bblack@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jobo out of all services on: 2395 hosts [17:21:38] swfrench-wmf: checking.. [17:23:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80832 and previous config saved to /var/cache/conftool/dbconfig/20250805-172327-fceratto.json [17:27:13] swfrench-wmf: LGTM :) [17:27:26] Krinkle: amazing, thank you! [17:27:29] !log swfrench@deploy1003 swfrench: Continuing with sync [17:28:43] !log swfrench@deploy1003 Finished scap sync-world: Migrate debug and cli images to xhprof - T401152 (duration: 22m 02s) [17:28:46] T401152: Switch wmf-config/Profiler from Tideways to XHProf - https://phabricator.wikimedia.org/T401152 [17:31:40] (03PS1) 10Aleksandar Mastilovic: Add collation to the list of sqooped table [puppet] - 10https://gerrit.wikimedia.org/r/1175924 (https://phabricator.wikimedia.org/T397923) [17:33:04] !log krinkle@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [17:33:11] (03PS9) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [17:33:26] (03CR) 10Dzahn: add zoekt from upstream and blubber builder config to build it (031 comment) [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:34:26] (03CR) 10CI reject: [V:04-1] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:34:44] (03CR) 10Cwhite: [C:03+2] Update nsca_frack.cfg.erb remove frdb1003, add frdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/1175886 (https://phabricator.wikimedia.org/T369922) (owner: 10Jgreen) [17:35:14] !log krinkle@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [17:37:21] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:37:23] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:37:24] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:37:26] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:37:27] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:37:30] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:38:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T399728)', diff saved to https://phabricator.wikimedia.org/P80833 and previous config saved to /var/cache/conftool/dbconfig/20250805-173835-fceratto.json [17:38:39] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:38:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:40:24] (03CR) 10Scott French: [C:03+2] mediawiki: install php8.1-xhprof on beta cluster and mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/1175623 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [17:42:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [17:42:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T399728)', diff saved to https://phabricator.wikimedia.org/P80834 and previous config saved to /var/cache/conftool/dbconfig/20250805-174219-fceratto.json [17:42:34] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1024.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250714/ using stat1009.eqiad.wmnet) [17:45:06] (03PS1) 10Herron: pyrra: logstash-requests-pilot add slo_revision label [puppet] - 10https://gerrit.wikimedia.org/r/1175927 [17:45:43] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238 (10Jclark-ctr) 03NEW [17:46:13] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11062046 (10Jclark-ctr) [17:46:15] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11062047 (10Jclark-ctr) [17:46:18] (03PS1) 10Clare Ming: Alertmanager: add receiver and routing for experiment-platform tasks [puppet] - 10https://gerrit.wikimedia.org/r/1175928 (https://phabricator.wikimedia.org/T398422) [17:47:05] (03CR) 10Marco Fossati: "Gotcha, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:47:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T399728)', diff saved to https://phabricator.wikimedia.org/P80835 and previous config saved to /var/cache/conftool/dbconfig/20250805-174734-fceratto.json [17:47:38] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:47:59] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240 (10Jclark-ctr) 03NEW [17:48:18] (03CR) 10Jasmine: [C:03+2] site.pp: assign wikikube-ctrl2006 to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1175912 (https://phabricator.wikimedia.org/T400661) (owner: 10Jasmine) [17:48:33] (03CR) 10Herron: [C:03+2] pyrra: logstash-requests-pilot add slo_revision label [puppet] - 10https://gerrit.wikimedia.org/r/1175927 (owner: 10Herron) [17:48:44] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11062097 (10Jclark-ctr) p:05Triage→03Medium [17:49:06] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d8-eqiad - https://phabricator.wikimedia.org/T401240#11062101 (10Jclark-ctr) [17:49:09] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11062102 (10Jclark-ctr) [17:49:18] (03CR) 10Clare Ming: mw::maintenance: ExperimentationLab periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [17:50:18] jasmine_: ok to puppet merge? [17:50:33] feel free to multiple mine too if you're already there [17:50:52] was just going to ask the same, yes pls proceed :) [17:53:13] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11062144 (10Jclark-ctr) | **Device A** | **Port A** | **Device B** | **Port B** | **Cable Type** | **Notes**... [17:55:45] jasmine_: done! [17:56:52] ty! [18:00:04] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T1800) [18:02:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P80836 and previous config saved to /var/cache/conftool/dbconfig/20250805-180241-fceratto.json [18:03:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11062198 (10jasmine_) >>! In T400661#11041473, @RobH wrote: > @jasimine_, > > > Please update the site.pp file with the insetup role for your team (detaile... [18:03:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11062199 (10jasmine_) a:05jasmine_→03None [18:11:09] o/ nothing for this window. [18:16:26] !log dancy@deploy1003 Installing scap version "4.196.0" for 2 host(s) [18:17:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P80837 and previous config saved to /var/cache/conftool/dbconfig/20250805-181749-fceratto.json [18:18:03] (03PS1) 10Andrew Bogott: Trivial profile for a cloud-vps-hosted chartmuseum instance [puppet] - 10https://gerrit.wikimedia.org/r/1175932 (https://phabricator.wikimedia.org/T393782) [18:18:12] !log dancy@deploy1003 Installation of scap version "4.196.0" completed for 2 hosts [18:18:29] (03CR) 10CI reject: [V:04-1] Trivial profile for a cloud-vps-hosted chartmuseum instance [puppet] - 10https://gerrit.wikimedia.org/r/1175932 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [18:20:49] swfrench-wmf: ok to rollout the cleanup step now? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175621 [18:21:01] (03PS2) 10Andrew Bogott: Trivial profile for a cloud-vps-hosted chartmuseum instance [puppet] - 10https://gerrit.wikimedia.org/r/1175932 (https://phabricator.wikimedia.org/T393782) [18:22:22] Krinkle: I just deployed a scap update which may add a 10 minute delay to your deployment. I can roll back the scap update first if needed. [18:23:19] Krinkle: I can't think of a reason why not, no [18:23:43] (03CR) 10Andrew Bogott: [C:03+2] Trivial profile for a cloud-vps-hosted chartmuseum instance [puppet] - 10https://gerrit.wikimedia.org/r/1175932 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [18:23:51] dancy: np, whenever you're done, no rush [18:24:06] OK. Lemme run a command then I'll ping you. [18:25:15] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [18:27:15] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs2007.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:27:19] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [18:29:13] (03PS1) 10Dzahn: phabricator: block some scrapers and bots at apache level [puppet] - 10https://gerrit.wikimedia.org/r/1175933 [18:29:43] (03CR) 10CI reject: [V:04-1] phabricator: block some scrapers and bots at apache level [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn) [18:32:30] (03PS2) 10Dzahn: phabricator: block some scrapers and bots at apache level [puppet] - 10https://gerrit.wikimedia.org/r/1175933 [18:32:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T399728)', diff saved to https://phabricator.wikimedia.org/P80838 and previous config saved to /var/cache/conftool/dbconfig/20250805-183256-fceratto.json [18:33:00] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:33:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [18:33:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T399728)', diff saved to https://phabricator.wikimedia.org/P80839 and previous config saved to /var/cache/conftool/dbconfig/20250805-183319-fceratto.json [18:34:49] (03CR) 10Dzahn: [C:04-1] "some of these seem more legit than others - discussion needed" [puppet] - 10https://gerrit.wikimedia.org/r/1175933 (owner: 10Dzahn) [18:35:40] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [18:35:42] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:35:44] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:35:45] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:35:47] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:35:49] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:35:49] (03CR) 10Thcipriani: add zoekt from upstream and blubber builder config to build it (031 comment) [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:35:54] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 39s) [18:37:00] !log dancy@deploy1003 Started scap sync-world: testing T398875 [18:37:03] T398875: Publish updated wmf/next container when deploying config backport or security patch - https://phabricator.wikimedia.org/T398875 [18:37:34] (03PS3) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:38:00] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:38:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T399728)', diff saved to https://phabricator.wikimedia.org/P80840 and previous config saved to /var/cache/conftool/dbconfig/20250805-183824-fceratto.json [18:38:28] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:39:41] (03PS10) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [18:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:39:54] !log dancy@deploy1003 Finished scap sync-world: testing T398875 (duration: 02m 54s) [18:41:01] (03PS1) 10Andrew Bogott: profile::wmcs::chartmuseum: also install helm [puppet] - 10https://gerrit.wikimedia.org/r/1175936 [18:41:27] (03PS2) 10Andrew Bogott: profile::wmcs::chartmuseum: also install helm [puppet] - 10https://gerrit.wikimedia.org/r/1175936 [18:41:35] Krinkle: I'm out of the way. [18:41:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:42:18] (03PS4) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:44:14] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::chartmuseum: also install helm [puppet] - 10https://gerrit.wikimedia.org/r/1175936 (owner: 10Andrew Bogott) [18:44:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [18:44:46] (03CR) 10CI reject: [V:04-1] Profiler: Remove support for php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [18:44:58] (03PS3) 10Krinkle: Profiler: Remove support for php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) [18:45:04] (03CR) 10TrainBranchBot: "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [18:45:54] (03Merged) 10jenkins-bot: Profiler: Remove support for php-tideways_xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175621 (https://phabricator.wikimedia.org/T401152) (owner: 10Krinkle) [18:46:17] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1175621|Profiler: Remove support for php-tideways_xhprof (T401152)]] [18:46:23] T401152: Switch wmf-config/Profiler from Tideways to XHProf - https://phabricator.wikimedia.org/T401152 [18:47:24] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:48:04] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1175621|Profiler: Remove support for php-tideways_xhprof (T401152)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:50:07] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:26] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dbprov2007 [18:50:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dbprov2007 [18:50:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:51:18] (03CR) 10Dzahn: "something is happening now :) thanks!" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:51:40] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:50] (03CR) 10Dzahn: "if you use "check experimental" you'll probably want to add some Hosts: headers too or it tries to compile on every single machine" [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [18:53:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P80841 and previous config saved to /var/cache/conftool/dbconfig/20250805-185332-fceratto.json [18:54:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:55:54] !log krinkle@deploy1003 krinkle: Continuing with sync [18:56:32] brouberol: the git_pull_charts service on deploy1003 is stuck because of local changes in helmfile.d/dse-k8s-services/airflow-ml/values-production.yaml -- is that you? :) [18:57:38] (03CR) 10Dzahn: [V:04-1 C:04-1] "Php::Extension[apcu]: has no parameter named 'shm_size'" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:58:24] inflatador, ryankemper: about 9 PM for brouberol, do you happen to know anything? ^ [18:58:48] the charts repo hasn't updated for about four hours, if I don't hear anything back I'll revert the local change [19:00:11] inflatador: you around to take a look? I’m our w the dog for next 40’ [19:01:11] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175621|Profiler: Remove support for php-tideways_xhprof (T401152)]] (duration: 14m 54s) [19:01:14] T401152: Switch wmf-config/Profiler from Tideways to XHProf - https://phabricator.wikimedia.org/T401152 [19:01:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host dbprov2007.codfw.wmnet with OS bookworm [19:01:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11062361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm [19:01:31] rzl just got back to my desk [19:01:56] feel free to revert the change [19:02:13] doing it, thanks [19:04:00] !log rzl@deploy1003:/srv/deployment-charts$ sudo git restore helmfile.d/dse-k8s-services/airflow-ml/values-production.yaml # discarding local changes to unblock the minutely git pull [19:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] next pull succeeded, thanks [19:06:25] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P80842 and previous config saved to /var/cache/conftool/dbconfig/20250805-190840-fceratto.json [19:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:16:37] rzl: let me check [19:17:21] sorry about that. Rule of thumb, if one of my diff is blocking anything, stash/checkout -f it from orbit [19:17:24] brouberol: ah thanks -- after getting the okay from inflatador I reverted your local changes so I don't need anything else from you this evening, but let me know if I can help [19:17:45] was sit something related to airflow-ml ? [19:17:48] *was it [19:18:07] oh, right, I just saw your !log. [19:18:09] I don't have the exact diff anymore but yes [19:18:15] Sorry again, that's on me [19:18:28] it's been a long day that followed a very short night [19:18:34] that's okay, thanks for the response :) [19:20:31] jhancock@cumin1003 reimage (PID 736158) is awaiting input [19:22:42] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [19:22:46] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [19:23:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T399728)', diff saved to https://phabricator.wikimedia.org/P80843 and previous config saved to /var/cache/conftool/dbconfig/20250805-192347-fceratto.json [19:23:51] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:24:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1178.eqiad.wmnet with reason: Maintenance [19:24:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80844 and previous config saved to /var/cache/conftool/dbconfig/20250805-192410-fceratto.json [19:30:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80845 and previous config saved to /var/cache/conftool/dbconfig/20250805-193016-fceratto.json [19:30:21] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:31:48] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11062464 (10VRiley-WMF) @Jclark-ctr The cable lengths you listed work for me. We can certainly use that as a template for the rest of the cables. You mentioned that these already have been ord... [19:36:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11062481 (10VRiley-WMF) @cmooney I wanted to verify this with you, if I'm understanding this correct, You want the man... [19:39:27] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:40:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062484 (10Jclark-ctr) a:03Jclark-ctr [19:45:10] jclark@cumin1002 netbox (PID 2291069) is awaiting input [19:45:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P80846 and previous config saved to /var/cache/conftool/dbconfig/20250805-194524-fceratto.json [19:47:33] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11062498 (10Jclark-ctr) @VRiley-WMF I’ve ordered a bulk amount of cables, but we should have a rough idea of what’s actually needed, as we had to get something ordered in order to proceed. Th... [19:47:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for dbprov1007 - jclark@cumin1002" [19:47:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for dbprov1007 - jclark@cumin1002" [19:47:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:08] !log [gitlab2002:~] $ sudo systemctl start wmf_auto_restart_ssh-gitlab T401191 [19:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:12] T401191: SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab2002 - https://phabricator.wikimedia.org/T401191 [19:49:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host dbprov1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:56:09] (03CR) 10Dzahn: [C:03+1] "this is ready to go per latest comments on ticket and slack" [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:56:26] (03CR) 10Dzahn: [C:03+1] "let me rebase it to remove the merge conflict" [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:56:32] (03CR) 10BCornwall: [C:04-2] "The following domains are not returning NS records despite them being set in MarkMonitor, so this shouldn't be merged yet." [dns] - 10https://gerrit.wikimedia.org/r/1175587 (owner: 10Ncmonitor) [19:58:27] (03PS3) 10Dzahn: admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:58:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062549 (10Jclark-ctr) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T2000). nyaa~ [20:00:05] theproton and musikanimal: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P80847 and previous config saved to /var/cache/conftool/dbconfig/20250805-200031-fceratto.json [20:00:59] (03CR) 10Dzahn: [C:03+2] admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [20:01:45] jclark@cumin1002 provision (PID 2299652) is awaiting input [20:04:55] I'm around [20:05:53] theproton: I can deploy your change [20:06:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174583 (https://phabricator.wikimedia.org/T400281) (owner: 10Theprotonade) [20:07:01] (03Merged) 10jenkins-bot: Enable bulk OCR on beta wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174583 (https://phabricator.wikimedia.org/T400281) (owner: 10Theprotonade) [20:08:26] theproton: Done (beta-only changes) [20:10:59] theproton: The beta-side deployment is in progress. [20:11:21] okay sure [20:14:07] (03PS1) 10Aaron Schulz: Add restbase spec JSON files to which /rest_v1/?spec can be routed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) [20:15:33] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11062608 (10VRiley-WMF) 05Resolved→03Open [20:15:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80848 and previous config saved to /var/cache/conftool/dbconfig/20250805-201539-fceratto.json [20:15:43] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:15:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [20:16:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80849 and previous config saved to /var/cache/conftool/dbconfig/20250805-201601-fceratto.json [20:17:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062614 (10Jclark-ctr) [20:19:05] theproton: Should be live in beta by now. [20:19:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:20:29] (03PS1) 10Andrew Bogott: profile::wmcs::chartmuseum: reposition chartmuseum repo files in /srv [puppet] - 10https://gerrit.wikimedia.org/r/1175943 [20:20:51] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2008.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:20:54] (03CR) 10CI reject: [V:04-1] profile::wmcs::chartmuseum: reposition chartmuseum repo files in /srv [puppet] - 10https://gerrit.wikimedia.org/r/1175943 (owner: 10Andrew Bogott) [20:20:54] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:21:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80850 and previous config saved to /var/cache/conftool/dbconfig/20250805-202104-fceratto.json [20:21:08] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:22:05] (03PS2) 10Andrew Bogott: profile::wmcs::chartmuseum: reposition chartmuseum repo files in /srv [puppet] - 10https://gerrit.wikimedia.org/r/1175943 [20:25:06] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::chartmuseum: reposition chartmuseum repo files in /srv [puppet] - 10https://gerrit.wikimedia.org/r/1175943 (owner: 10Andrew Bogott) [20:25:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11062624 (10ayounsi) That's correct, and 0/0/0 is ready. [20:27:37] (03Abandoned) 10Clare Ming: Temporarily add config var back in for group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175561 (https://phabricator.wikimedia.org/T401135) (owner: 10Clare Ming) [20:35:41] !log starting cluster mutation test on relforge*.eqiad.wmnet servers [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P80851 and previous config saved to /var/cache/conftool/dbconfig/20250805-203612-fceratto.json [20:40:08] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2007.codfw.wmnet with OS bookworm [20:40:21] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov2007 - https://phabricator.wikimedia.org/T400402#11062667 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host dbprov2007.codfw.wmnet with OS bookworm executed with errors: - dbprov20... [20:43:28] (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:44:12] FIRING: [4x] SystemdUnitFailed: opensearch-disable-readahead-relforge-eqiad.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:13] 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11062670 (10Jclark-ctr) [20:44:51] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11062671 (10wiki_willy) a:03VRiley-WMF [20:46:09] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for cloudcephosd1036.mgmt:22 - https://phabricator.wikimedia.org/T401210#11062674 (10wiki_willy) a:03VRiley-WMF [20:46:44] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:46:48] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:48:42] dancy: sorry I missed the ping. Are you still around to do deploys? [20:48:56] Sure. [20:49:01] awesome!! [20:49:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175581 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [20:50:23] (03Merged) 10jenkins-bot: beta: use CodeMirror instead of CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175581 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [20:51:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P80852 and previous config saved to /var/cache/conftool/dbconfig/20250805-205119-fceratto.json [20:52:06] musikanimal: The production side is done. The beta stuff will deploy via jenkins as usual. [20:52:47] production side? the patch should only effect the beta cluster [20:53:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11062685 (10wiki_willy) a:03VRiley-WMF [20:54:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad.service.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:41] dancy: pardon my ignorance! so the changes should be live on Beta in up to 10 minutes or so, like normal as things are merged into master? I was expecting to be able to test via WikimediaDebug first [20:56:21] Your change only had beta config changes, so it had no effect on prod (which includes testwikis) [20:56:32] okay good, hehe :) [20:56:34] scap skips production sync in that case. [20:56:51] (based on the files that are in the commit, not the content of the changes) [20:58:35] alright, and I see it is live on beta now and working as intended. Thanks!! [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250805T2100) [21:01:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:04:12] RESOLVED: SystemdUnitFailed: opensearch_1@relforge-eqiad.service.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:06:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80853 and previous config saved to /var/cache/conftool/dbconfig/20250805-210627-fceratto.json [21:06:31] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:06:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [21:06:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T399728)', diff saved to https://phabricator.wikimedia.org/P80854 and previous config saved to /var/cache/conftool/dbconfig/20250805-210649-fceratto.json [21:07:44] (03CR) 10BCornwall: "MM rep says that the NS servers must respond to requests by the TLD registries, so we need to include these first and then the NS validati" [dns] - 10https://gerrit.wikimedia.org/r/1175587 (owner: 10Ncmonitor) [21:07:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [21:08:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062727 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [21:08:27] (03PS1) 10Scott French: php8.1: rebuild to pick up 8.1.33-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175951 (https://phabricator.wikimedia.org/T383047) [21:11:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T399728)', diff saved to https://phabricator.wikimedia.org/P80855 and previous config saved to /var/cache/conftool/dbconfig/20250805-211153-fceratto.json [21:11:57] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:14:52] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) reloading scholarly_articles on wdqs1024.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250714/ using stat1009.eqiad.wmnet) [21:17:20] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:17:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:17:24] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:18:19] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:21:35] (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1175587 (owner: 10Ncmonitor) [21:22:01] !log brett@dns1004 START - running authdns-update [21:22:56] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:23:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-scholarly_443: Servers wdqs1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:23:46] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-scholarly,name=eqiad [21:24:27] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [21:24:30] ^^ expected [21:24:42] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [21:24:44] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [21:24:59] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [21:25:00] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [21:25:16] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [21:25:36] !log brett@dns1004 END - running authdns-update [21:26:02] jclark@cumin1002 reimage (PID 2378866) is awaiting input [21:26:09] I've acked the pybal alerts, eqiad is depooled and we should not get any further alerts [21:27:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P80856 and previous config saved to /var/cache/conftool/dbconfig/20250805-212701-fceratto.json [21:28:26] (03PS1) 10Cwhite: opensearch: add extra config option to elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1175953 [21:30:13] (03PS2) 10Cwhite: opensearch: add extra config option to elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1175953 [21:31:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1007.eqiad.wmnet with OS bookworm [21:31:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm executed with errors: - dbprov1007... [21:31:43] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [21:31:45] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [21:31:46] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [21:31:48] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [21:31:49] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [21:31:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [21:37:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1007.eqiad.wmnet with OS bookworm [21:38:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm [21:40:40] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1175953/6501/" [puppet] - 10https://gerrit.wikimedia.org/r/1175953 (owner: 10Cwhite) [21:42:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P80857 and previous config saved to /var/cache/conftool/dbconfig/20250805-214208-fceratto.json [21:46:19] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2010.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:46:22] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:48:31] (03PS1) 10Cwhite: logging: remove unexpected true [puppet] - 10https://gerrit.wikimedia.org/r/1175956 [21:48:59] (03CR) 10Cwhite: [V:03+2 C:03+2] logging: remove unexpected true [puppet] - 10https://gerrit.wikimedia.org/r/1175956 (owner: 10Cwhite) [21:55:36] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1007.eqiad.wmnet with reason: host reimage [21:57:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T399728)', diff saved to https://phabricator.wikimedia.org/P80858 and previous config saved to /var/cache/conftool/dbconfig/20250805-215715-fceratto.json [21:57:19] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:57:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance [21:57:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T399728)', diff saved to https://phabricator.wikimedia.org/P80859 and previous config saved to /var/cache/conftool/dbconfig/20250805-215738-fceratto.json [21:59:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1007.eqiad.wmnet with reason: host reimage [22:02:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T399728)', diff saved to https://phabricator.wikimedia.org/P80860 and previous config saved to /var/cache/conftool/dbconfig/20250805-220238-fceratto.json [22:02:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:03:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:03:56] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:05:11] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:05:14] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:17:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:17:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P80861 and previous config saved to /var/cache/conftool/dbconfig/20250805-221746-fceratto.json [22:18:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:18:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov1007.eqiad.wmnet with OS bookworm [22:18:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host dbprov1007.eqiad.wmnet with OS bookworm completed: - dbprov1007 (**PASS**)... [22:19:18] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175958 (https://phabricator.wikimedia.org/T384107) [22:23:26] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175958 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci) [22:24:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062935 (10Jclark-ctr) [22:24:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11062936 (10Jclark-ctr) 05Open→03Resolved [22:25:00] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.1 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175958 (https://phabricator.wikimedia.org/T384107) (owner: 10Santiago Faci) [22:32:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P80862 and previous config saved to /var/cache/conftool/dbconfig/20250805-223253-fceratto.json [22:48:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T399728)', diff saved to https://phabricator.wikimedia.org/P80863 and previous config saved to /var/cache/conftool/dbconfig/20250805-224801-fceratto.json [22:48:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:48:10] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [22:48:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [22:48:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T399728)', diff saved to https://phabricator.wikimedia.org/P80864 and previous config saved to /var/cache/conftool/dbconfig/20250805-224824-fceratto.json [22:48:40] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [22:51:59] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.1 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175961 (https://phabricator.wikimedia.org/T384107) [22:52:35] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-scholarly,name=eqiad [22:53:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T399728)', diff saved to https://phabricator.wikimedia.org/P80865 and previous config saved to /var/cache/conftool/dbconfig/20250805-225320-fceratto.json [22:53:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:55:48] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:55:51] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:59:33] (03CR) 10Tim Starling: "It authorizes my specific user account (tstarling@wikimedia.org) so it's not really proper as a permanent solution. Once the sitemap is in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling) [23:08:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P80866 and previous config saved to /var/cache/conftool/dbconfig/20250805-230828-fceratto.json [23:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:23:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P80867 and previous config saved to /var/cache/conftool/dbconfig/20250805-232336-fceratto.json [23:30:07] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175617 (owner: 10TrainBranchBot) [23:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175968 [23:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175968 (owner: 10TrainBranchBot) [23:38:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T399728)', diff saved to https://phabricator.wikimedia.org/P80868 and previous config saved to /var/cache/conftool/dbconfig/20250805-233843-fceratto.json [23:38:47] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [23:39:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [23:39:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T399728)', diff saved to https://phabricator.wikimedia.org/P80869 and previous config saved to /var/cache/conftool/dbconfig/20250805-233907-fceratto.json [23:43:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T399728)', diff saved to https://phabricator.wikimedia.org/P80870 and previous config saved to /var/cache/conftool/dbconfig/20250805-234358-fceratto.json [23:44:02] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [23:50:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175968 (owner: 10TrainBranchBot) [23:55:22] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2011.codfw.wmnet w/ force delete existing files, repooling both afterwards [23:55:26] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [23:59:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P80871 and previous config saved to /var/cache/conftool/dbconfig/20250805-235905-fceratto.json