[00:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:10:15] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1240 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:10:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:28] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "I see! I think I missed those because I searched for gerrit.wikimedia.org. I looked again at the alertmanager yaml.  You are right, there " [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)
[00:21:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487049 (10phaultfinder)
[00:35:30] <wikibugs>	 (03PS1) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594
[00:36:27] <wikibugs>	 (03PS1) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817)
[00:37:37] <wikibugs>	 (03PS2) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817)
[00:38:04] <wikibugs>	 (03PS2) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594
[00:38:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott)
[00:38:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596 (owner: 10TrainBranchBot)
[00:41:49] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[00:42:13] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[00:42:18] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113594 as an alternative suggestion to fix this - which would keep informin" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto)
[00:44:06] <tzatziki>	 !log removing 1 file for legal complaince
[00:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:35] <tzatziki>	 !log removing 2 files for legal compliance
[00:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:30] <wikibugs>	 (03PS1) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes, again [puppet] - 10https://gerrit.wikimedia.org/r/1113597 (https://phabricator.wikimedia.org/T383817)
[00:58:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596 (owner: 10TrainBranchBot)
[00:58:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cephosd.cfg partman: reduce minimum partition sizes, again [puppet] - 10https://gerrit.wikimedia.org/r/1113597 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott)
[00:59:46] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[01:00:23] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[01:06:11] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[01:06:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[01:08:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598
[01:08:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598 (owner: 10TrainBranchBot)
[01:21:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:22:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:22:55] <wikibugs>	 (03PS1) 10Andrew Bogott: squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599
[01:23:06] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage
[01:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487134 (10phaultfinder)
[01:26:29] <wikibugs>	 (03CR) 10Andrew Bogott: "No big deal here, I just noticed this because puppet started failing last week on a deployment-prep VM.  Why last week, no idea." [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott)
[01:27:00] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage
[01:29:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598 (owner: 10TrainBranchBot)
[01:30:43] <wikibugs>	 (03CR) 10Eevans: "Of course!" [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans)
[01:36:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "back from 2020 https://gerrit.wikimedia.org/r/c/operations/puppet/+/579915" [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott)
[01:37:35] <wikibugs>	 (03CR) 10Dzahn: Prometheus Squid exporter, specify proxy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[01:46:25] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/398a8379f919b36c3c30162c6ac61d37db0f3c5790eecdd4b618010ab98ee51e/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:49:27] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[01:50:00] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[02:02:31] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:06:25] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:06:50] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage
[02:08:17] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1240 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:15] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage
[02:30:02] <wikibugs>	 (03PS2) 10Andrew Bogott: squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599
[02:30:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Update nic names for cloudceph1012/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113601
[02:30:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Update nic names for cloudceph1012/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113601 (owner: 10Andrew Bogott)
[02:35:39] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[02:38:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487171 (10phaultfinder)
[03:22:28] <jinxer-wm>	 FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic1088:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:27:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[03:32:28] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:21:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487273 (10phaultfinder)
[05:53:17] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 60%, RTA = 30.84 ms
[05:59:41] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[06:41:30] <wikibugs>	 (03PS1) 10Marostegui: db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113718 (https://phabricator.wikimedia.org/T383709)
[06:41:41] <marostegui>	 !log Powering off db2189 for onsite maintenance T383709
[06:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:46] <stashbot>	 T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709
[06:42:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113718 (https://phabricator.wikimedia.org/T383709) (owner: 10Marostegui)
[06:42:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2189.codfw.wmnet with reason: Onsite work
[06:42:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2189 T383709', diff saved to https://phabricator.wikimedia.org/P72237 and previous config saved to /var/cache/conftool/dbconfig/20250123-064241-marostegui.json
[06:50:03] <wikibugs>	 (03CR) 10Marostegui: site.pp, db2134.yaml: db2134 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto)
[06:55:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1021.eqiad.wmnet
[06:56:37] <wikibugs>	 (03PS1) 10Marostegui: es1021: Remove [puppet] - 10https://gerrit.wikimedia.org/r/1113719 (https://phabricator.wikimedia.org/T384418)
[06:58:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1021: Remove [puppet] - 10https://gerrit.wikimedia.org/r/1113719 (https://phabricator.wikimedia.org/T384418) (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0700)
[07:00:05] <jouncebot>	 marostegui and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0700).
[07:01:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[07:08:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:08:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[07:08:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:08:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1021.eqiad.wmnet
[07:09:31] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418#10487690 (10Marostegui) a:05Marostegui→03None
[07:09:41] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418#10487694 (10Marostegui) This is ready for #dc-ops
[07:13:32] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es1022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113721 (https://phabricator.wikimedia.org/T384566)
[07:14:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113721 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui)
[07:15:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1022 from dbctl T384566', diff saved to https://phabricator.wikimedia.org/P72239 and previous config saved to /var/cache/conftool/dbconfig/20250123-071529-root.json
[07:15:34] <stashbot>	 T384566: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566
[07:16:56] <wikibugs>	 (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566)
[07:17:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui)
[07:17:37] <wikibugs>	 (03PS2) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566)
[07:28:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet
[07:29:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[07:29:16] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10487743 (10ops-monitoring-bot) Draining ganeti2032.codfw.wmnet of running VMs
[07:35:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P72240 and previous config saved to /var/cache/conftool/dbconfig/20250123-073557-marostegui.json
[07:36:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2011.codfw.wmnet with reason: Kernel reboot
[07:36:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: Kernel reboot
[07:39:43] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113445 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[07:44:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend comment [puppet] - 10https://gerrit.wikimedia.org/r/1113487 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:47:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1 after kernel reboots', diff saved to https://phabricator.wikimedia.org/P72241 and previous config saved to /var/cache/conftool/dbconfig/20250123-074759-marostegui.json
[07:48:02] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, `0.3.4` is the "old" version with coredns `1.8.7`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[07:49:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565)
[07:50:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565)
[07:55:37] <wikibugs>	 (03PS2) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113461
[07:55:37] <wikibugs>	 (03PS1) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113741
[08:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0800). nyaa~
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:29] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13739MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[08:00:35] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:03:21] <moritzm>	 !log installing glibc updates on bullseye
[08:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:31] <wikibugs>	 (03PS2) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919)
[08:19:24] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172)
[08:21:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:56] <wikibugs>	 (03PS3) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919)
[08:21:56] <wikibugs>	 (03PS1) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919)
[08:21:58] <wikibugs>	 (03PS1) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (Step 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919)
[08:22:45] <wikibugs>	 (03PS2) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919)
[08:25:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc2 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P72242 and previous config saved to /var/cache/conftool/dbconfig/20250123-082545-marostegui.json
[08:26:01] <wikibugs>	 (03PS1) 10DCausse: wdqs: cleanup unused settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113745 (https://phabricator.wikimedia.org/T374919)
[08:26:05] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1012.eqiad.wmnet with reason: Kernel reboot
[08:26:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2012.codfw.wmnet with reason: Kernel reboot
[08:28:03] <wikibugs>	 (03PS9) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[08:33:24] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:35:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc2 after kernel reboots', diff saved to https://phabricator.wikimedia.org/P72244 and previous config saved to /var/cache/conftool/dbconfig/20250123-083524-marostegui.json
[08:46:05] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[08:47:24] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[08:48:09] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[08:49:20] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[08:50:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[08:52:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] librenms: Ensure the cache/data directory belongs to librenms [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) (owner: 10Andrea Denisse)
[08:52:43] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[08:52:57] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565)
[08:53:33] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[08:57:27] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[08:57:59] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:00:15] <jouncebot>	 brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0900).
[09:09:32] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, just the nit about the naming (feel free to ignore)" [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[09:17:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Pin coredns version on all clustes to 0.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:17:50] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:20:03] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui)
[09:21:15] <wikibugs>	 (03Merged) 10jenkins-bot: Pin coredns version on all clustes to 0.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:21:54] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[09:22:19] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[09:22:41] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:26:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui)
[09:27:06] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:30:38] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
[09:32:41] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "🍿" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:33:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:35:15] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_ulsfo
[09:36:25] <wikibugs>	 (03PS2) 10Federico Ceratto: site.pp, db2134.yaml: db2134 [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476)
[09:36:41] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (206752s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:37:40] <wikibugs>	 (03PS1) 10DCausse: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702)
[09:39:52] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^2 "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113751
[09:40:03] <wikibugs>	 (03PS1) 10JMeybohm: Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984)
[09:43:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^2 "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113751 (owner: 10Vgutierrez)
[09:45:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10487931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 0.14.1 is out, I'll import and upgrade
[09:45:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Looks good, remember to merge this AFTER the script has run" [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto)
[09:45:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet
[09:49:29] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: index wikitech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113462 (owner: 10DCausse)
[09:50:46] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: index wikitech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113462 (owner: 10DCausse)
[09:51:01] <wikibugs>	 (03PS2) 10Btullis: Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947)
[09:51:19] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[09:51:37] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:53:27] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
[09:53:50] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp
[09:54:27] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_ulsfo
[09:55:00] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:55:32] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2032.codfw.wmnet with reason: remove from cluster for reimage
[09:55:38] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10487951 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93df70a9-c65f-4aaf-8a3d-5ab698636ed0) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[09:57:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113465 (owner: 10Muehlenhoff)
[10:01:47] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_magru
[10:01:55] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[10:02:19] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[10:04:10] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[10:05:13] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[10:05:22] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:08:48] <moritzm>	 !log installing routinator security updates
[10:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:03] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:14:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2032.codfw.wmnet with OS bookworm
[10:14:57] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2032.codfw.wmnet with OS bookworm
[10:15:37] <wikibugs>	 (03PS2) 10JMeybohm: Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984)
[10:16:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:18:12] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp
[10:19:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_magru
[10:22:00] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:24:11] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:24:33] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:26:36] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ms-be2075.codfw.wmnet with reason: hardware broken awaiting vendor action
[10:26:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:26:47] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10488032 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=62b3cb8f-dcae-4290-af1d-2a50d3785cb2) set by mvernon@cumin2002 for 7 days, 0:00:00 on 1 host(s) and t...
[10:32:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add kemayo to the deployment group - https://phabricator.wikimedia.org/T384493#10488038 (10jcrespo) 05Open→03Resolved a:05jcrespo→03CDanis
[10:32:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage
[10:32:58] <wikibugs>	 (03PS6) 10Jcrespo: admin: Deploy WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824)
[10:35:00] <wikibugs>	 (03CR) 10Jelto: "Looks mostly good, I left some comments about the redundant name suffixes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:36:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage
[10:39:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse)
[10:39:03] <logmsgbot>	 jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[10:39:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse)
[10:41:10] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "I saw noone objecting to both the patch and the docs, so merging." [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) (owner: 10Jcrespo)
[10:43:39] <wikibugs>	 (03PS1) 10Urbanecm: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254)
[10:43:50] <wikibugs>	 (03PS1) 10Urbanecm: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254)
[10:44:34] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31), 13Patch-For-Review: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10488103 (10jcrespo) 05Open→03Resolved This is now applied.
[10:46:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[10:46:23] <urbanecm>	 jouncebot: nowandnext
[10:46:23] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0900)
[10:46:23] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1100)
[10:46:38] <urbanecm>	 I'm going to deploy a fix for a train blocker
[10:46:50] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[10:46:54] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[10:46:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:49:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488137 (10jcrespo) @Neslihan_Turan_WMDE This is still blocked on you providing an email and your developer (Gerrit/IDM/LDAP) account id.
[10:50:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[10:50:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[10:50:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "FYI this is causing PuppetConstantChange alerts on k8s hosts due to" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[10:51:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:51:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:52:17] <godog>	 jayme: ^ FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112782/comments/428c5bfb_d0cab2ae
[10:56:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2032.codfw.wmnet with OS bookworm
[10:56:22] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2032.codfw.wmnet with OS bookworm completed: - ganeti203...
[10:57:52] <jynus>	 !log pausing media backups on eqiad for maintenance T383902
[10:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:56] <stashbot>	 T383902: Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them - https://phabricator.wikimedia.org/T383902
[11:00:14] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1100)
[11:04:00] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1204.eqiad.wmnet with reason: os upgrade
[11:04:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet
[11:04:45] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1205.eqiad.wmnet with reason: os upgrade
[11:06:41] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1204.eqiad.wmnet with OS bookworm
[11:08:06] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_eqsin
[11:09:19] <wikibugs>	 (03PS2) 10FNegri: wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961)
[11:09:25] <wikibugs>	 (03Merged) 10jenkins-bot: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[11:09:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[11:09:48] <wikibugs>	 (03CR) 10FNegri: wmcs: update kernel alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:12:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet
[11:13:00] <wikibugs>	 (03CR) 10Urbanecm: [V:03+2 C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[11:13:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm)
[11:13:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-wmde: remove extra network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109926 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol)
[11:13:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol)
[11:14:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2032.codfw.wmnet to cluster codfw and group B
[11:14:12] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]]
[11:14:16] <stashbot>	 T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254
[11:14:49] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2032.codfw.wmnet to cluster codfw and group B
[11:17:38] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:18:14] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[11:19:43] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:19:48] <stashbot>	 T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254
[11:23:30] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage
[11:25:23] <wikibugs>	 (03PS1) 10Vgutierrez: secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764
[11:25:41] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764 (owner: 10Vgutierrez)
[11:25:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488255 (10MoritzMuehlenhoff)
[11:26:40] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage
[11:28:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488267 (10fgiunchedi) I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top of my head would be to have a map network...
[11:28:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet
[11:28:48] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_eqsin
[11:28:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488268 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs
[11:29:07] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765
[11:30:23] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195)
[11:31:22] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[11:31:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2136 T384479', diff saved to https://phabricator.wikimedia.org/P72247 and previous config saved to /var/cache/conftool/dbconfig/20250123-113157-fceratto.json
[11:32:02] <stashbot>	 T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479
[11:33:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet
[11:33:49] <wikibugs>	 (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot)
[11:34:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet
[11:34:12] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488279 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs
[11:34:42] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_codfw
[11:35:02] <logmsgbot>	 !log urbanecm@deploy2002 Sync cancelled.
[11:35:45] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]]
[11:35:49] <stashbot>	 T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254
[11:37:40] <vgutierrez>	 !log upload acme-chief 0.38 to apt.wm.org (bookworm-wikimedia)
[11:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:26] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Issue unified cert using pki.goog [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195)
[11:47:02] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:47:24] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:47:37] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:48:22] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[11:48:58] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[11:49:42] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1204.eqiad.wmnet with OS bookworm
[11:51:02] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488310 (10Neslihan_Turan_WMDE) Hi, yesterday a problem about my Wikitech account has been fixed. I think now we should be able to proceed. My WMDE email adress is neslihan.turan@wiki...
[11:51:09] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] (duration: 15m 23s)
[11:51:13] <stashbot>	 T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254
[11:51:16] <urbanecm>	 finally
[11:51:22] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[11:51:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10488314 (10kamila)
[11:52:40] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_codfw
[11:53:50] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_drmrs
[11:54:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113769
[11:59:12] <wikibugs>	 (03PS1) 10Kamila Součková: wikikube: rename parse100[1-6] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571)
[12:00:19] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2134.codfw.wmnet
[12:05:14] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.dns.netbox
[12:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:49] <federico3>	 there's a pending change in DNS for wmf6779     https://phabricator.wikimedia.org/P72248
[12:11:38] <wikibugs>	 (03PS3) 10Btullis: dumps: Configure snapshot1012 with the dumps trait [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947)
[12:12:54] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2134.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[12:13:25] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2134.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002"
[12:13:25] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:13:25] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2134.codfw.wmnet
[12:13:36] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_drmrs
[12:14:07] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] site.pp, db2134.yaml: db2134 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto)
[12:14:22] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "decommission script ran, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto)
[12:14:25] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] site.pp, db2134.yaml: db2134 [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto)
[12:14:37] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney)
[12:15:59] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot)
[12:16:18] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[12:16:49] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[12:17:10] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot)
[12:17:12] <federico3>	 !log Removing db2134 from zarcillo T384476
[12:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:17] <stashbot>	 T384476: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476
[12:18:38] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488500 (10FCeratto-WMF) 05In progress→03Open
[12:19:25] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488505 (10FCeratto-WMF)
[12:19:45] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:51] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488511 (10FCeratto-WMF) Ready for DC ops to decommission
[12:19:51] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:35] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:41] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:21:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:21:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:23:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:23:54] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:24:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947)
[12:25:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[12:26:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488524 (10cmooney) >>! In T384345#10488267, @fgiunchedi wrote: > I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top...
[12:28:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488529 (10cmooney)
[12:28:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488530 (10cmooney)
[12:28:46] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947)
[12:31:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488538 (10jcrespo) Thank you.  @KFrancis you have their provided email above: neslihan.turan@wikimedia.de  @Neslihan_Turan_WMDE Please note the uid identifier associated with that em...
[12:31:51] <marostegui>	 !log Deploy schema change on s8 codfw with replication dbmaint T384592
[12:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:56] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:34:55] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[12:36:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[12:37:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:37:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72249 and previous config saved to /var/cache/conftool/dbconfig/20250123-123708-marostegui.json
[12:37:13] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[12:39:14] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[12:41:16] <topranks>	 !log restarting gnmic.service on netflow1002 
[12:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:31] <jinxer-wm>	 FIRING: [2x] Emergency syslog message: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[12:46:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:49:31] <jinxer-wm>	 RESOLVED: [2x] Emergency syslog message: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[12:51:21] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[12:51:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a6b392ba-8b36-4fa0-8d3d-10c8b2d2eb48) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th...
[12:51:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dumps: Configure snapshot1012 with the dumps trait (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300)
[13:02:20] <wikibugs>	 (03PS1) 10Ladsgroup: file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481)
[13:02:36] <Amir1>	 jouncebot: nowandnext
[13:02:36] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300)
[13:02:36] <jouncebot>	 In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400)
[13:02:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72250 and previous config saved to /var/cache/conftool/dbconfig/20250123-130253-root.json
[13:03:04] <Amir1>	 is there any deployment for Mobileapps/RESTBase/Wikifeeds happening?
[13:03:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup)
[13:03:14] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171
[13:04:14] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db1222 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: ptwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:04:23] <marostegui>	 ^ taking it
[13:04:28] <jynus>	 thanks
[13:04:47] <wikibugs>	 (03PS1) 10JMeybohm: Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984)
[13:04:49] <wikibugs>	 (03PS1) 10JMeybohm: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801
[13:04:50] <Emperor>	 !incidents
[13:04:51] <sirenbot>	 5626 (UNACKED)  db1222 (paged)/MariaDB Replica SQL: s2 (paged)
[13:04:51] <sirenbot>	 5625 (RESOLVED)  ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams)
[13:04:56] <Emperor>	 !ack 5626
[13:04:56] <sirenbot>	 5626 (ACKED)  db1222 (paged)/MariaDB Replica SQL: s2 (paged)
[13:05:02] <Emperor>	 marostegui: thanks <3
[13:05:30] <effie>	 marostegui: tx 
[13:05:33] <Emperor>	 (should I resolve the p.age for that?)
[13:05:33] <marostegui>	 This is eqiad master, I will fix it to restart replication and schedule a master switch
[13:05:38] <marostegui>	 Emperor: please go ahead yes
[13:05:43] <Emperor>	 !resolve 5626
[13:05:43] <sirenbot>	 5626 (RESOLVED)  db1222 (paged)/MariaDB Replica SQL: s2 (paged)
[13:05:46] <marostegui>	 the recovery should be arriving in a bit
[13:06:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1001-1006].eqiad.wmnet
[13:06:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse100[1-6] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[13:06:36] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597)
[13:07:14] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 #page on db1222 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:07:48] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Index
[13:08:03] <jynus>	 there is lag on another s2 host: db1182
[13:08:12] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10488665 (10Jhancock.wm) a:03Jhancock.wm
[13:08:19] <marostegui>	 all of them are lagging jynus, as it was the intermediate master
[13:08:25] <marostegui>	 should recover soon
[13:08:27] <jynus>	 I get it now
[13:08:35] <marostegui>	 I downtimed it for 1h though
[13:08:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10488671 (10Jhancock.wm) a:03Jhancock.wm
[13:08:40] <marostegui>	 So it doesn't bother oncall
[13:08:52] <marostegui>	 https://phabricator.wikimedia.org/T384597 task for the switchover
[13:09:04] <Amir1>	 I'm around if you need me for anything
[13:09:15] <marostegui>	 nah it is all good Amir1 
[13:09:26] <jynus>	 I am worried that prometheus didn't get that lag
[13:09:34] <jynus>	 only icinga, so there is a regression there
[13:09:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T384597', diff saved to https://phabricator.wikimedia.org/P72251 and previous config saved to /var/cache/conftool/dbconfig/20250123-130937-marostegui.json
[13:09:42] <stashbot>	 T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597
[13:09:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1001-1006].eqiad.wmnet
[13:10:18] <dcausse>	 jouncebot: nowandnext
[13:10:18] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300)
[13:10:19] <jouncebot>	 In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400)
[13:10:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1001 to wikikube-worker1142
[13:10:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:11:16] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10488675 (10Jhancock.wm)
[13:11:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff)
[13:11:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:12:18] <wikibugs>	 (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[13:13:18] <marostegui>	 lag in eqiad s2 all good now
[13:13:24] <marostegui>	 Currently rebooting the candidate master
[13:13:33] <marostegui>	 To get it ready to become a dc master soonish
[13:13:59] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse)
[13:14:24] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[13:14:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1001 to wikikube-worker1142 - kamila@cumin1002"
[13:14:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1002 to wikikube-worker1143
[13:14:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c1a-eqiad
[13:14:48] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c1a-eqiad
[13:14:49] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[13:14:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1001 to wikikube-worker1142 - kamila@cumin1002"
[13:14:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:14:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1142
[13:15:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:15:31] <wikibugs>	 (03PS2) 10Elukey: drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477
[13:15:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup)
[13:15:46] <wikibugs>	 (03CR) 10Elukey: drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey)
[13:16:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1142
[13:16:25] <wikibugs>	 (03CR) 10Elukey: [C:03+1] profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:16:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1001 to wikikube-worker1142
[13:17:04] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1162.eqiad.wmnet with reason: Index rebuild
[13:17:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi)
[13:17:52] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113804
[13:17:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72252 and previous config saved to /var/cache/conftool/dbconfig/20250123-131758-root.json
[13:18:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72253 and previous config saved to /var/cache/conftool/dbconfig/20250123-131805-root.json
[13:18:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113804 (owner: 10Marostegui)
[13:18:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1002 to wikikube-worker1143 - kamila@cumin1002"
[13:18:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1003 to wikikube-worker1144
[13:18:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1002 to wikikube-worker1143 - kamila@cumin1002"
[13:18:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:18:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1143
[13:19:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:19:48] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm)
[13:20:04] <wikibugs>	 (03Merged) 10jenkins-bot: file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup)
[13:20:07] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] dumps: Configure snapshot1012 with the dumps trait [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[13:20:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1143
[13:20:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1002 to wikikube-worker1143
[13:20:59] <wikibugs>	 (03PS1) 10Elukey: mapnik: skip copying mapnik files to /usr/local [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285)
[13:21:33] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]]
[13:21:37] <stashbot>	 T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481
[13:21:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:23:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1003 to wikikube-worker1144 - kamila@cumin1002"
[13:23:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1004 to wikikube-worker1145
[13:23:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1003 to wikikube-worker1144 - kamila@cumin1002"
[13:23:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:23:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1144
[13:23:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:24:35] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:24:39] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[13:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10488719 (10phaultfinder)
[13:24:53] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1144
[13:25:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1003 to wikikube-worker1144
[13:26:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488724 (10Neslihan_Turan_WMDE) Yes, that's me @jcrespo
[13:27:18] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488725 (10jcrespo)
[13:28:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1004 to wikikube-worker1145 - kamila@cumin1002"
[13:28:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1005 to wikikube-worker1146
[13:28:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1004 to wikikube-worker1145 - kamila@cumin1002"
[13:28:43] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:28:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1145
[13:28:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:29:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488746 (10jcrespo) Thank you, now only waiting on NDA to be filled in and we can apply the privilege change.  I am sorry to hear you had problems with Wikitech, apparently the migrat...
[13:29:33] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[13:29:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488748 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f0f61f83-b1f7-48c8-9e4a-2e436917a7d3) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th...
[13:30:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1145
[13:30:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1004 to wikikube-worker1145
[13:31:17] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]] (duration: 09m 43s)
[13:31:21] <stashbot>	 T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481
[13:31:50] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_esams
[13:33:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72255 and previous config saved to /var/cache/conftool/dbconfig/20250123-133304-root.json
[13:33:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72256 and previous config saved to /var/cache/conftool/dbconfig/20250123-133311-root.json
[13:36:44] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm)
[13:37:19] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:37:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1005 to wikikube-worker1146 - kamila@cumin1002"
[13:38:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1005 to wikikube-worker1146 - kamila@cumin1002"
[13:38:27] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:38:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1146
[13:38:57] <Lucas_WMDE>	 FYI dcausse and others, I will not be able to do the backport+config window today, sorry
[13:39:23] * TheresNoTime can do!
[13:39:23] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott)
[13:41:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1006 to wikikube-worker1147
[13:41:16] <godog>	 !log bounce mtail on centrallog2002 - high system cpu usage and perf top reports native_queued_spin_lock_slowpath
[13:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:41:31] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479)
[13:42:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1146
[13:42:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] deployment-prep hiera: remove uses of .eqiad.wmflabs tld [puppet] - 10https://gerrit.wikimedia.org/r/1113468 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott)
[13:42:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1005 to wikikube-worker1146
[13:43:22] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172)
[13:44:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[13:46:32] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: reimage
[13:47:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:48:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:48:32] <wikibugs>	 (03PS10) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[13:48:40] <wikibugs>	 (03CR) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:48:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:48:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Remember that once this is merged, you'll have to go to any cumin host and commit the change." [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto)
[13:49:31] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_esams
[13:49:56] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_eqiad
[13:50:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1006 to wikikube-worker1147 - kamila@cumin1002"
[13:50:54] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1239.eqiad.wmnet with OS bookworm
[13:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:52:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1006 to wikikube-worker1147 - kamila@cumin1002"
[13:52:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:52:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1147
[13:52:51] <marostegui>	 federico3: There is a change from db2140 waiting to be merged in dbctl
[13:53:23] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:53:36] <marostegui>	 federico3: I assume it is your for the decomm of db2140?
[13:53:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[13:53:46] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:54:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1147
[13:55:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1006 to wikikube-worker1147
[13:56:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[13:56:23] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Enroll 0.1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[13:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:56:50] <federico3>	 interesting, I suppose it could be a timeout due to the cookbook waiting for confirmation...mabye?
[13:56:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2140 T384480', diff saved to https://phabricator.wikimedia.org/P72257 and previous config saved to /var/cache/conftool/dbconfig/20250123-135655-fceratto.json
[13:56:59] <stashbot>	 T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480
[13:57:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72258 and previous config saved to /var/cache/conftool/dbconfig/20250123-135704-root.json
[13:57:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72259 and previous config saved to /var/cache/conftool/dbconfig/20250123-135704-root.json
[13:57:30] <marostegui>	 federico3: which cookbook?
[13:57:55] <federico3>	 depooling db2140
[13:58:01] <marostegui>	 federico3: My guess is that you did dbctl instance db2140 depool but didn't issue the dbctl config commit -m "blablabl" to commit the change
[13:58:07] <wikibugs>	 (03PS1) 10Muehlenhoff: osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565)
[13:58:09] <marostegui>	 So the change is pending to be committed 
[13:58:24] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:58:34] <marostegui>	 federico3: If a change isn't committed, it will block all the other pending changes
[13:58:46] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[13:58:49] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] "Oops. I'll clean that up manually - it's only staging-codfw hosts that have files in there. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm)
[13:59:40] <federico3>	 I started the depooling cookbook, then it asked me for final confirmation "Enter y or yes to confirm:" and I was checking with you 
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400).
[14:00:05] <jouncebot>	 dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:15] <marostegui>	 federico3: yeah, I guess it went above the threshold of the alert
[14:00:24] <marostegui>	 and that's why it fired
[14:00:44] <TheresNoTime>	 o/ can deploy
[14:00:52] <dcausse>	 o/
[14:01:05] <wikibugs>	 (03CR) 10Jelto: "thanks for adding the new team! One comment on-line regarding the different receivers." [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[14:01:41] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[14:03:30] <TheresNoTime>	 dcausse: `Change '1113462', project 'operations/deployment-charts', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.44.0-wmf.12', '1.44.0-wmf.13']` — issue with the "depends-on" ?
[14:03:53] <dcausse>	 TheresNoTime: looking
[14:05:14] <TheresNoTime>	 https://phabricator.wikimedia.org/P72260 for ref (also bug with the 'N' selection, heh)
[14:05:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db1165', diff saved to https://phabricator.wikimedia.org/P72261 and previous config saved to /var/cache/conftool/dbconfig/20250123-140524-marostegui.json
[14:05:29] <dcausse>	 TheresNoTime: did not know that scap would complain if Depends-On was on a non MW repo
[14:05:33] <dcausse>	 will remove
[14:05:46] <TheresNoTime>	 ack, haven't seen that before personally!
[14:06:30] <wikibugs>	 (03PS2) 10DCausse: cirrus: stop writing to wikitech index from the MW JobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463
[14:06:30] <wikibugs>	 (03PS2) 10DCausse: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702)
[14:06:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:06:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:06:44] <dcausse>	 TheresNoTime: uploaded new ones, and please feel free to deploy both of them at once
[14:06:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72262 and previous config saved to /var/cache/conftool/dbconfig/20250123-140649-marostegui.json
[14:06:53] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:07:25] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_eqiad
[14:07:38] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage
[14:07:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse)
[14:07:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse)
[14:08:34] <wikibugs>	 (03CR) 10Klausman: [C:03+1] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212 (owner: 10BCornwall)
[14:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: stop writing to wikitech index from the MW JobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse)
[14:08:40] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse)
[14:09:11] <logmsgbot>	 !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]]
[14:09:15] <stashbot>	 T374702: Cleanup: Remove deprecated weighted tag methods - https://phabricator.wikimedia.org/T374702
[14:09:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72263 and previous config saved to /var/cache/conftool/dbconfig/20250123-140957-marostegui.json
[14:10:34] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage
[14:12:07] <wikibugs>	 (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[14:12:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72264 and previous config saved to /var/cache/conftool/dbconfig/20250123-141209-root.json
[14:12:58] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[14:12:59] <dcausse>	 TheresNoTime: when deploying there might be few warnings in the log "Received {$jobType} job with {$updateGroup} updates for an unwritable cluster $cluster." these are expected and can be ignored
[14:13:26] <TheresNoTime>	 ack
[14:13:48] <logmsgbot>	 !log samtar@deploy2002 dcausse, samtar: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:13:49] <dcausse>	 but I suspect there won't be that much, it would be only pending writes from wikitech which I suspect is not that many
[14:14:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff)
[14:14:08] <TheresNoTime>	 dcausse: anything you need to test further? ^
[14:14:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[14:14:12] <dcausse>	 TheresNoTime: this can't be tested on test servers
[14:14:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452)
[14:14:19] <logmsgbot>	 !log samtar@deploy2002 dcausse, samtar: Continuing with sync
[14:16:04] <wikibugs>	 (03PS1) 10Cathal Mooney: Revert "Add BGP data collection from network devices over GNMI" This reverts commit a8bc5da977f0de2aa87e0060b40df3197240189c. [puppet] - 10https://gerrit.wikimedia.org/r/1113812
[14:16:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Add BGP data collection from network devices over GNMI" This reverts commit a8bc5da977f0de2aa87e0060b40df3197240189c. [puppet] - 10https://gerrit.wikimedia.org/r/1113812 (owner: 10Cathal Mooney)
[14:18:34] <wikibugs>	 (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:20:00] <wikibugs>	 (03PS2) 10Cathal Mooney: Revert "Add BGP data collection from network devices over GNMI" [puppet] - 10https://gerrit.wikimedia.org/r/1113812
[14:20:27] <wikibugs>	 (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:20:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Revert "Add BGP data collection from network devices over GNMI" [puppet] - 10https://gerrit.wikimedia.org/r/1113812 (owner: 10Cathal Mooney)
[14:20:36] <wikibugs>	 (03CR) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:20:48] <wikibugs>	 (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:21:11] <logmsgbot>	 !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]] (duration: 12m 00s)
[14:21:16] <stashbot>	 T374702: Cleanup: Remove deprecated weighted tag methods - https://phabricator.wikimedia.org/T374702
[14:21:23] <TheresNoTime>	 dcausse: live :)
[14:21:29] <dcausse>	 TheresNoTime: thanks! :)
[14:21:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1142.eqiad.wmnet wikikube-worker1143.eqiad.wmnet wikikube-worker1144.eqiad.wmnet wikikube-worker1145.eqiad.wmnet wikikube-worker1146.eqiad.wmnet wikikube-worker1147.eqiad.wmnet on all recursors
[14:21:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1142.eqiad.wmnet wikikube-worker1143.eqiad.wmnet wikikube-worker1144.eqiad.wmnet wikikube-worker1145.eqiad.wmnet wikikube-worker1146.eqiad.wmnet wikikube-worker1147.eqiad.wmnet on all recursors
[14:21:59] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[14:22:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:23:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[14:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10488868 (10phaultfinder)
[14:25:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P72266 and previous config saved to /var/cache/conftool/dbconfig/20250123-142504-marostegui.json
[14:26:07] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] mapnik: skip copying mapnik files to /usr/local [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey)
[14:26:19] <TheresNoTime>	 !log UTC afternoon backport window done
[14:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:27] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "this looks good and should add the missing team and receiver, minor concerns in-line." [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[14:26:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff)
[14:27:01] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1142.eqiad.wmnet with OS bookworm
[14:27:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1142
[14:27:04] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1142
[14:27:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1143.eqiad.wmnet with OS bookworm
[14:27:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72267 and previous config saved to /var/cache/conftool/dbconfig/20250123-142715-root.json
[14:27:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1143
[14:27:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1143
[14:27:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1144.eqiad.wmnet with OS bookworm
[14:27:28] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1144
[14:27:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1144
[14:27:31] <wikibugs>	 (03CR) 10Elukey: [C:03+1] osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:27:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1145.eqiad.wmnet with OS bookworm
[14:27:36] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1145
[14:27:37] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1145
[14:27:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:27:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1146.eqiad.wmnet with OS bookworm
[14:27:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1146
[14:27:53] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1146
[14:28:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1147.eqiad.wmnet with OS bookworm
[14:28:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1147
[14:28:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1147
[14:28:44] <wikibugs>	 (03PS1) 10FNegri: base::cloud_production: fix dep name [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961)
[14:28:48] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS6460
[14:28:48] <icinga-wm>	 Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:28:48] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS6
[14:28:48] <icinga-wm>	 6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:28:58] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:30:49] <wikibugs>	 (03PS2) 10JMeybohm: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801
[14:31:16] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[14:31:44] <wikibugs>	 (03PS2) 10FNegri: base::cloud_production: fix dep name [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961)
[14:31:51] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:32:23] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou)
[14:33:22] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1239.eqiad.wmnet with OS bookworm
[14:33:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:36:06] <wikibugs>	 (03PS3) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[14:36:16] <sukhe>	 dhinus: https://puppetboard.wikimedia.org/failures
[14:36:34] <wikibugs>	 (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe)
[14:36:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff)
[14:36:52] <sukhe>	 >  Could not find declared class prometheus::node_kernel_panic
[14:36:56] <sukhe>	 I think this might be related to the recent change
[14:37:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[14:37:57] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:32] <wikibugs>	 (03PS3) 10FNegri: prometheus::node_kernel_messages: fix "absent" params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961)
[14:39:06] <vgutierrez>	 !log updating acme-chief on acmechief1002
[14:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:19] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:40:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P72269 and previous config saved to /var/cache/conftool/dbconfig/20250123-144011-marostegui.json
[14:41:16] <wikibugs>	 (03PS4) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[14:41:47] <wikibugs>	 (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe)
[14:42:09] <wikibugs>	 (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe)
[14:42:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete WMCS Puppet 5 master classes no longer used/needed [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:42:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488927 (10cmooney) So I rolled-back the patch to collect the BGP metrics.  The config puppet produced worked fine in magru and esams, but for some reason in eqiad stats...
[14:42:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[14:42:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1144.eqiad.wmnet with reason: host reimage
[14:42:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1142.eqiad.wmnet with reason: host reimage
[14:43:01] <wikibugs>	 (03PS4) 10FNegri: prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961)
[14:43:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1143.eqiad.wmnet with reason: host reimage
[14:43:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798)
[14:43:19] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1145.eqiad.wmnet with reason: host reimage
[14:43:25] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:43:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1146.eqiad.wmnet with reason: host reimage
[14:43:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1147.eqiad.wmnet with reason: host reimage
[14:43:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Issue unified cert using pki.goog [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez)
[14:44:14] <dhinus>	 sukhe: yep on it
[14:44:29] <sukhe>	 <3
[14:44:31] <dhinus>	 I pushed a change without checking PCC first :/
[14:45:16] <cdanis>	 most of us have been there 😅
[14:45:30] <sukhe>	 dhinus: all good, not the first time, not the last, been there done that :)
[14:45:37] <dhinus>	 :D
[14:46:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1144.eqiad.wmnet with reason: host reimage
[14:46:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet
[14:47:07] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:48:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:48:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:50:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1145.eqiad.wmnet with reason: host reimage
[14:50:07] <wikibugs>	 (03CR) 10FNegri: [C:03+2] prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:51:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:52:00] <wikibugs>	 (03PS2) 10JMeybohm: Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984)
[14:52:00] <wikibugs>	 (03PS11) 10JMeybohm: Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[14:52:00] <wikibugs>	 (03PS2) 10JMeybohm: Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984)
[14:52:01] <wikibugs>	 (03PS3) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984)
[14:53:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: demonstration - bking@cumin2002 - T380752
[14:53:15] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[14:53:19] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1146.eqiad.wmnet with reason: host reimage
[14:53:34] <wikibugs>	 (03PS5) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[14:54:47] <wikibugs>	 (03PS3) 10DCausse: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921)
[14:55:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72270 and previous config saved to /var/cache/conftool/dbconfig/20250123-145518-marostegui.json
[14:55:23] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[14:55:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[14:55:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72271 and previous config saved to /var/cache/conftool/dbconfig/20250123-145540-marostegui.json
[14:56:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1143.eqiad.wmnet with reason: host reimage
[14:58:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, we don't have a specified way for multiple teams at the moment, though what you did seems fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[14:58:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:59:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1147.eqiad.wmnet with reason: host reimage
[14:59:18] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10489014 (10Jhancock.wm) dell update. it's been escalated to the level 3 helpdesk. might not hear back from them until monday.
[15:00:39] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10489035 (10MatthewVernon) Thanks for the update!
[15:01:23] <wikibugs>	 (03PS3) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961)
[15:02:57] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72272 and previous config saved to /var/cache/conftool/dbconfig/20250123-150351-marostegui.json
[15:03:56] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:04:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1142.eqiad.wmnet with reason: host reimage
[15:05:56] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606)
[15:06:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1144.eqiad.wmnet with OS bookworm
[15:09:52] <wikibugs>	 (03CR) 10Raymond Ndibe: "Tested on puppet server. file was removed from toolsbeta worker nfs 9 node. Also confirmed that a pod can still be scheduled (so doesn't a" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe)
[15:11:00] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1145.eqiad.wmnet with OS bookworm
[15:11:01] <wikibugs>	 (03CR) 10Raymond Ndibe: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe)
[15:12:57] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606)
[15:13:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1146.eqiad.wmnet with OS bookworm
[15:13:22] <wikibugs>	 (03CR) 10FNegri: [C:03+2] wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[15:13:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:14:33] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[15:15:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1143.eqiad.wmnet with OS bookworm
[15:18:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1147.eqiad.wmnet with OS bookworm
[15:18:13] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez)
[15:18:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P72273 and previous config saved to /var/cache/conftool/dbconfig/20250123-151858-marostegui.json
[15:19:40] <brennen>	 jouncebot nowandnext
[15:19:40] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 40 minute(s)
[15:19:40] <jouncebot>	 In 0 hour(s) and 40 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600)
[15:19:53] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:21:29] <brennen>	 !log 1.44.0-wmf.13 train (T382364): unblocked, rolling to group1
[15:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:33] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[15:22:23] <wikibugs>	 (03CR) 10AOkoth: "I think the batch/v1 is pretty stable: https://kubernetes.io/docs/reference/using-api/deprecation-guide/ from reading this. We can test as" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[15:23:03] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:23:10] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364)
[15:23:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[15:23:22] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1142.eqiad.wmnet with OS bookworm
[15:23:58] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[15:24:08] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:24:37] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:27:57] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:27:59] <dcausse>	 jouncebot: nowandnext
[15:27:59] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 32 minute(s)
[15:27:59] <jouncebot>	 In 0 hour(s) and 32 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600)
[15:29:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10489148 (10kamila)
[15:30:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10489162 (10Jhancock.wm) 05Open→03Resolved a:05FCeratto-WMF→03Jhancock.wm
[15:31:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:31:27] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:31:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10489182 (10phaultfinder)
[15:31:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez)
[15:34:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez)
[15:34:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P72274 and previous config saved to /var/cache/conftool/dbconfig/20250123-153405-marostegui.json
[15:34:52] <wikibugs>	 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10489194 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm not seeing any new errors on this machine. gonna close this ticket for now, but if it errors again, feel free to reopen or start...
[15:35:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:35:06] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:35:08] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.13  refs T382364
[15:35:16] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[15:36:12] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto)
[15:36:14] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto)
[15:36:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:36:26] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet
[15:37:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez)
[15:37:17] <wikibugs>	 (03PS6) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[15:37:38] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:38:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489207 (10VRiley-WMF)
[15:38:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:40:20] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:40:47] <wikibugs>	 (03CR) 10David Caro: [C:03+1] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[15:41:05] <wikibugs>	 (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[15:42:32] <wikibugs>	 (03PS7) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[15:43:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:48:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2189
[15:48:24] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[15:48:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2189
[15:48:30] <wikibugs>	 (03CR) 10Scott French: "Thanks, Effie!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:48:46] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[15:50:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Removing db2136 T384479', diff saved to https://phabricator.wikimedia.org/P72276 and previous config saved to /var/cache/conftool/dbconfig/20250123-155016-fceratto.json
[15:50:21] <stashbot>	 T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479
[15:50:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72277 and previous config saved to /var/cache/conftool/dbconfig/20250123-155023-marostegui.json
[15:50:28] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:50:35] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489374 (10Jhancock.wm) @Marostegui db2189 is moved, updated, and pinging!
[15:50:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[15:50:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72278 and previous config saved to /var/cache/conftool/dbconfig/20250123-155045-marostegui.json
[15:50:48] <volans>	 federico3: FYI the above alert is what we get if there are changes in dbctl not commited for some time
[15:50:54] <volans>	 it will recover now ofc
[15:50:56] <wikibugs>	 (03PS1) 10Jgiannelos: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824
[15:51:15] <wikibugs>	 (03PS8) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532)
[15:51:15] <volans>	 not sure if you had already the chance to see it in action, hence why I'm mentioning it :)
[15:51:34] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489378 (10Jhancock.wm)
[15:52:17] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824 (owner: 10Jgiannelos)
[15:53:15] <federico3>	 volans: I'm aware - I was looking at dbctl config diff but how is it getting changes from the CR?
[15:53:23] <wikibugs>	 (03Merged) 10jenkins-bot: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824 (owner: 10Jgiannelos)
[15:53:24] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[15:53:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489385 (10fnegri)
[15:53:46] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[15:53:48] <volans>	 when you run puppet-merge
[15:55:24] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
[15:56:07] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
[15:56:18] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[15:59:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72279 and previous config saved to /var/cache/conftool/dbconfig/20250123-155910-marostegui.json
[15:59:15] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[15:59:16] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Nice, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli)
[16:00:04] <jouncebot>	 brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600).
[16:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:06:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489448 (10fnegri) This is firing again today.
[16:07:18] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489449 (10Marostegui) >>! In T383709#10489374, @Jhancock.wm wrote: > @Marostegui db2189 is moved, updated, and...
[16:07:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72280 and previous config saved to /var/cache/conftool/dbconfig/20250123-160730-root.json
[16:09:23] <jinxer-wm>	 FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:10:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489466 (10VRiley-WMF)
[16:11:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[16:12:19] <jinxer-wm>	 RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:14:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P72281 and previous config saved to /var/cache/conftool/dbconfig/20250123-161417-marostegui.json
[16:15:10] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:16:42] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489498 (10RobH) a:05RobH→03Vgutierrez @Vgutierrez,  >>! In T382026#10489496, @RobH wrote: >> Good afternoon Dear >> The infrastructure team installed a Blanking Panel...
[16:16:55] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:18:57] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489523 (10ssingh) a:05Vgutierrez→03BCornwall
[16:19:39] <wikibugs>	 (03PS1) 10Brouberol: airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833
[16:19:51] <wikibugs>	 (03PS1) 10Subramanya Sastry: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464)
[16:20:16] <wikibugs>	 (03CR) 10DCausse: [C:03+2] "PS3 only bumps from 0.10.0 (broken) to 0.11.0" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse)
[16:21:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[16:21:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol)
[16:21:20] <wikibugs>	 (03PS2) 10Brouberol: airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833
[16:21:40] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse)
[16:21:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:22:09] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4859/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol)
[16:22:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[16:22:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72282 and previous config saved to /var/cache/conftool/dbconfig/20250123-162235-root.json
[16:22:42] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2022.codfw.wmnet with reason: remove from cluster for reimage
[16:22:48] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10489533 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=46a6b03e-0964-494b-92f3-40af6ca3beb9) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[16:22:53] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol)
[16:23:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113769 (owner: 10Muehlenhoff)
[16:23:36] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[16:24:03] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[16:25:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1162 with weight 0 T384597', diff saved to https://phabricator.wikimedia.org/P72283 and previous config saved to /var/cache/conftool/dbconfig/20250123-162552-root.json
[16:25:57] <stashbot>	 T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597
[16:26:05] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T384597
[16:26:38] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597)
[16:26:54] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:27:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597) (owner: 10Gerrit maintenance bot)
[16:28:10] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:28:58] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[16:29:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P72284 and previous config saved to /var/cache/conftool/dbconfig/20250123-162924-marostegui.json
[16:29:52] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[16:31:59] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[16:33:10] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:33:11] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[16:33:24] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: demonstration - bking@cumin2002 - T380752
[16:33:28] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[16:33:50] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[16:34:41] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[16:35:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489643 (10VRiley-WMF) 05Open→03Resolved
[16:36:54] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:36:55] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489651 (10Jhancock.wm) @JMeybohm, what do you think of this schedule for getting these moved? wikikube-worker2...
[16:37:11] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[16:37:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638#10489656 (10VRiley-WMF) a:03VRiley-WMF
[16:37:59] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[16:38:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638#10489657 (10VRiley-WMF) 05Open→03Resolved Loose power cable.
[16:39:26] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[16:40:24] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[16:41:05] <wikibugs>	 (03PS1) 10Elukey: services: bump kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113842 (https://phabricator.wikimedia.org/T384530)
[16:42:03] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489669 (10JMeybohm) >>! In T383709#10489651, @Jhancock.wm wrote: > @JMeybohm, what do you think of this schedu...
[16:42:51] <marostegui>	 !log Starting s2 eqiad failover from db1222 to db1162 - T384597
[16:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:55] <stashbot>	 T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597
[16:43:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1162 to s2 primary T384597', diff saved to https://phabricator.wikimedia.org/P72285 and previous config saved to /var/cache/conftool/dbconfig/20250123-164322-root.json
[16:44:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222 T384597', diff saved to https://phabricator.wikimedia.org/P72286 and previous config saved to /var/cache/conftool/dbconfig/20250123-164415-marostegui.json
[16:44:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72287 and previous config saved to /var/cache/conftool/dbconfig/20250123-164431-marostegui.json
[16:44:43] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:44:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[16:44:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72288 and previous config saved to /var/cache/conftool/dbconfig/20250123-164453-marostegui.json
[16:45:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[16:47:05] <wikibugs>	 (03PS1) 10Marostegui: db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113844 (https://phabricator.wikimedia.org/T382842)
[16:47:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1222.eqiad.wmnet with reason: Index rebuild
[16:47:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113844 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[16:47:57] <wikibugs>	 (03CR) 10Subramanya Sastry: "recheck" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[16:51:22] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748)
[16:51:41] <wikibugs>	 (03PS1) 10Marostegui: rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842)
[16:53:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: bump kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113842 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey)
[16:53:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72289 and previous config saved to /var/cache/conftool/dbconfig/20250123-165309-marostegui.json
[16:53:14] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[16:53:56] <wikibugs>	 (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[16:53:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[16:54:31] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui)
[16:54:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:58:26] <wikibugs>	 (03PS11) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565)
[16:59:30] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[16:59:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:59:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:00:05] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[17:05:31] <wikibugs>	 (03CR) 10CDanis: [C:04-2] "to be deployed only after 1.44.0-wmf.13 is live on group2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis)
[17:08:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P72290 and previous config saved to /var/cache/conftool/dbconfig/20250123-170816-marostegui.json
[17:09:12] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489797 (10fnegri) Moving it out of #wmcs-hardware and back to #cloud-services-team because otherwise @phaultfinder keeps on creating new tasks for this alert.
[17:09:42] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489802 (10fnegri)
[17:15:33] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489809 (10Eevans) cassandra-dev2001 can be moved at your leisure (no coordination is needed).
[17:15:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] librenms: Ensure the cache/data directory belongs to librenms [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) (owner: 10Andrea Denisse)
[17:16:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10489811 (10phaultfinder)
[17:18:33] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489812 (10Jhancock.wm)
[17:20:10] <papaul>	 !log power down cassandra-dev2001 for maintenance 
[17:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10489846 (10elukey) ` >>> pprint(r.request("get", "/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7").json()) {'@odata...
[17:23:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P72292 and previous config saved to /var/cache/conftool/dbconfig/20250123-172323-marostegui.json
[17:26:32] <wikibugs>	 (03PS12) 10JMeybohm: Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984)
[17:26:32] <wikibugs>	 (03PS3) 10JMeybohm: Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984)
[17:26:33] <wikibugs>	 (03PS4) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984)
[17:28:58] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480)
[17:33:07] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2001
[17:33:12] <wikibugs>	 (03CR) 10JMeybohm: Update staging-codfw to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:33:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2001
[17:33:18] <wikibugs>	 (03PS1) 10Andrea Denisse: librenms: Fix path to the cache/data directory [puppet] - 10https://gerrit.wikimedia.org/r/1113850 (https://phabricator.wikimedia.org/T384440)
[17:38:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72293 and previous config saved to /var/cache/conftool/dbconfig/20250123-173830-marostegui.json
[17:38:35] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:38:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[17:38:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72294 and previous config saved to /var/cache/conftool/dbconfig/20250123-173852-marostegui.json
[17:46:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72295 and previous config saved to /var/cache/conftool/dbconfig/20250123-174641-marostegui.json
[17:46:46] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[17:49:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:00:05] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1800)
[18:01:06] <bd808>	 Nothing for me to ship today jouncebot. Thanks for the reminder though. You are a good bot. :)
[18:01:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P72296 and previous config saved to /var/cache/conftool/dbconfig/20250123-180148-marostegui.json
[18:05:42] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:05:56] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:08:38] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:08:48] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet
[18:09:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1142-1147].eqiad.wmnet
[18:12:36] <wikibugs>	 (03PS1) 10Cathal Mooney: Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345)
[18:15:02] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney)
[18:16:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P72297 and previous config saved to /var/cache/conftool/dbconfig/20250123-181655-marostegui.json
[18:17:20] <wikibugs>	 (03PS2) 10Cathal Mooney: Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345)
[18:19:04] <wikibugs>	 (03PS1) 10Andrew Bogott: dsh: remove librenms group entirely [puppet] - 10https://gerrit.wikimedia.org/r/1113855 (https://phabricator.wikimedia.org/T380679)
[18:19:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] dsh: remove librenms group entirely [puppet] - 10https://gerrit.wikimedia.org/r/1113855 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott)
[18:20:07] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney)
[18:25:57] <wikibugs>	 (03PS1) 10CDanis: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629)
[18:27:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney)
[18:31:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis)
[18:31:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney)
[18:32:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72298 and previous config saved to /var/cache/conftool/dbconfig/20250123-183202-marostegui.json
[18:32:07] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[18:32:18] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[18:37:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:39:21] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi)
[18:42:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:42:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[18:43:36] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqord
[18:43:51] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqord
[18:44:21] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqdfw
[18:44:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqdfw
[18:49:11] <wikibugs>	 (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[18:51:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:53:04] <wikibugs>	 (03PS3) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594
[18:53:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440#10490214 (10andrea.denisse)
[18:53:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440#10490215 (10andrea.denisse) 05Open→03Resolved
[18:53:58] <wikibugs>	 (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn)
[19:00:05] <jouncebot>	 brennen and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1900)
[19:01:04] <brennen>	 hello.
[19:02:53] <wikibugs>	 (03CR) 10Majavah: [C:03+1] C:netbox: Allow NDA group to access Netbox. [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede)
[19:04:52] <taavi>	 !log fix my netbox account T373702
[19:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:56] <stashbot>	 T373702: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702
[19:08:12] <wikibugs>	 (03PS2) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193)
[19:11:26] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:14:01] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:14:42] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[19:15:52] <brennen>	 !log 1.44.0-wmf.13 train (T382364): no current blockers, logs relatively clean, rolling to all wikis.
[19:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:56] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[19:16:10] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364)
[19:16:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[19:16:22] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:16:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10490318 (10phaultfinder)
[19:16:57] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot)
[19:18:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[19:18:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72299 and previous config saved to /var/cache/conftool/dbconfig/20250123-191808-marostegui.json
[19:18:13] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:18:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  clou~dgw1004 - vriley@cumin1002"
[19:18:23] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  clou~dgw1004 - vriley@cumin1002"
[19:18:23] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:19:16] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:19:29] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:19:34] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:21:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:21:31] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:22:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:22:48] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:24:51] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212 (owner: 10BCornwall)
[19:25:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72300 and previous config saved to /var/cache/conftool/dbconfig/20250123-192517-marostegui.json
[19:25:22] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[19:26:59] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384645 (10phaultfinder) 03NEW
[19:33:10] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.13  refs T382364
[19:33:15] <stashbot>	 T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364
[19:37:25] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10490386 (10Jhancock.wm)
[19:38:09] <cdanis>	 brennen: things looking good?
[19:40:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P72301 and previous config saved to /var/cache/conftool/dbconfig/20250123-194024-marostegui.json
[19:41:21] <brennen>	 cdanis: yeah, pretty chill
[19:41:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10490388 (10cmooney) 05Open→03Resolved a:03cmooney This is working now  {F58260515 width=700}
[19:43:31] <cdanis>	 brennen: cool, any objections to me sneaking in my backport right now?
[19:43:57] <brennen>	 go right ahead
[19:44:17] <wikibugs>	 (03PS2) 10Jforrester: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis)
[19:44:22] <brennen>	 there're one or two little things i'm keeping an eye on, but nothing throwing a high rate of errors.
[19:44:37] <wikibugs>	 (03CR) 10Jforrester: "(Re-cherry-picked merely to inject the -x hash attribution.)" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis)
[19:46:53] <cdanis>	 James_F: was that just a description edit?
[19:47:02] <cdanis>	 cccccbefecbvtkvnknlggcfdhgheuguuvihrueultngv
[19:47:06] <cdanis>	 sigh
[19:47:12] <James_F>	 cdanis: Yup, and hello to your YubiKey too.
[19:47:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis)
[19:55:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P72302 and previous config saved to /var/cache/conftool/dbconfig/20250123-195531-marostegui.json
[20:05:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:07:13] <wikibugs>	 (03Merged) 10jenkins-bot: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis)
[20:07:28] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]]
[20:07:33] <stashbot>	 T384629: Mediawiki OTel exports broken as of wmf.12 release - https://phabricator.wikimedia.org/T384629
[20:10:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72303 and previous config saved to /var/cache/conftool/dbconfig/20250123-201038-marostegui.json
[20:10:43] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:10:54] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[20:11:45] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:12:21] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Continuing with sync
[20:19:53] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]] (duration: 12m 25s)
[20:20:02] <cdanis>	 hooray
[20:27:20] <wikibugs>	 (03PS1) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225)
[20:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10490485 (10phaultfinder)
[20:33:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:42:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2151.codfw.wmnet with reason: Maintenance
[20:42:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72304 and previous config saved to /var/cache/conftool/dbconfig/20250123-204245-marostegui.json
[20:42:50] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[20:46:38] <wikibugs>	 (03PS2) 10CDanis: chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748)
[20:57:35] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10490551 (10KFrancis) I'm processing the NDA now.  I'll confirm when it's complete.  Thanks!
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T2100).
[21:00:05] <jouncebot>	 cscott and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:19] <cscott>	 i'm here!
[21:00:35] <cjming>	 o/
[21:00:39] <cjming>	 i can deploy
[21:00:54] <cjming>	 unless cscott you'd like to self-deploy?
[21:01:52] <cscott>	 no, i appreciate the help
[21:02:05] <cscott>	 i'd rather you deploy, i'm very rusty
[21:02:11] <cjming>	 np!
[21:03:13] <cjming>	 cscott: should this be rebased on 1.44.0-wmf.13 
[21:03:24] <cjming>	 ?
[21:03:43] <cscott>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1113834 is on the wmf.13 branch I think?
[21:03:52] <cscott>	 maybe i put the wrong patch on the deploy calendar?
[21:04:49] <cjming>	 i think that's right - i just usually try to rebase patches before scap backporting them - i think it's safe to rebase on top of wmf.13
[21:05:28] <cscott>	 yeah, should be safe to rebase.  we just cherry-picked this this morning, but i guess maybe other patches landed on wmf.13 since then.
[21:05:38] <wikibugs>	 (03PS2) 10Subramanya Sastry: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464)
[21:06:12] <cwhite>	 o/
[21:06:57] <cscott>	 cjming: yeah looks like some changes to /libs/telemetry landed on .13 but they should be completely independent of our parser patch.
[21:07:23] <cjming>	 cscott: 18 mins for CI and i see it failing
[21:07:45] <wikibugs>	 (03PS1) 10Cathal Mooney: Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874
[21:08:23] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874 (owner: 10Cathal Mooney)
[21:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874 (owner: 10Cathal Mooney)
[21:10:09] <cjming>	 cscott: this is just the rebase
[21:12:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10490568 (10cmooney) 05Open→03Resolved Gonna close this one for now, the balance is better with the changes we added and we can review as time goes on.
[21:13:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72305 and previous config saved to /var/cache/conftool/dbconfig/20250123-211306-marostegui.json
[21:13:11] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:14:17] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[21:14:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b39f587-684b-42ab-a96c-cf552c03a29d) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th...
[21:15:00] <cjming>	 cscott: not sure how to proceed - presumably rebase will not pass CI - are you ok with me deploying the next patch in the queue while things get sorted out with your patch?
[21:16:23] <cscott>	 it appears to be a transient failure in API tests in CI.  go ahead and deploy the next patch in the queue, i'll see if I can kick CI.
[21:16:47] <cjming>	 cool thanks
[21:16:58] <cjming>	 hi cwhite: i'll do your config patch now
[21:17:10] <cscott>	 ^ failure appears to have nothing to do with our patch, just a race condition. https://www.irccloud.com/pastebin/wYoKpq5X/
[21:17:43] <cwhite>	 Thank you!
[21:18:15] <wikibugs>	 (03PS6) 10Krinkle: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[21:18:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[21:18:57] <wikibugs>	 (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[21:19:38] <cscott>	 on a /completely/ unrelated note, where can i formally submit a request that we have an OKR for reducing/eliminating spurious CI failures?
[21:19:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[21:20:23] <cjming>	 ++ to that OKR
[21:20:30] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[21:20:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:20:46] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]]
[21:22:30] <cscott>	 recheck looks successful, fingers crossed selenium doesn't crap out
[21:24:25] <cjming>	 cwhite: on test servers if verifiable - lmk if/when to sync
[21:25:01] <cwhite>	 checking
[21:25:08] <logmsgbot>	 !log cjming@deploy2002 cwhite, cjming: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:28:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P72306 and previous config saved to /var/cache/conftool/dbconfig/20250123-212813-marostegui.json
[21:29:12] <cwhite>	 cjming: looks good to me, please feel free to continue
[21:29:26] <cjming>	 coo
[21:29:28] <cjming>	 cool
[21:29:32] <logmsgbot>	 !log cjming@deploy2002 cwhite, cjming: Continuing with sync
[21:33:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10490638 (10cmooney)
[21:35:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:36:14] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]] (duration: 15m 28s)
[21:36:17] <cjming>	 cwhite: should be live :)
[21:37:02] <cjming>	 cscott: looking good - it'll be another 18+ mins to merge it
[21:37:11] <cwhite>	 Thank you!
[21:37:34] <cscott>	 cjming: it's in post-build script now
[21:38:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490654 (10cmooney) Fwiw I thought I saw a potential optimisation to allow us to go back to the "on change" style subscription.  gNMIc has a parameter that can be configu...
[21:38:28] <cscott>	 Finished: SUCCESS
[21:38:33] <cjming>	 yay!
[21:38:58] <cscott>	 (and boo for spurious CI failures and long CI times, but... sigh)
[21:39:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[21:43:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P72307 and previous config saved to /var/cache/conftool/dbconfig/20250123-214320-marostegui.json
[21:57:22] <wikibugs>	 (03Merged) 10jenkins-bot: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry)
[21:57:37] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]]
[21:57:42] <stashbot>	 T382464: Parsoid's list parsing seems to ingore one leading newline in templates causing rendering differences - https://phabricator.wikimedia.org/T382464
[21:58:24] <cscott>	 cjming: looks like it merged. i'm here to test canaries.
[21:58:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72308 and previous config saved to /var/cache/conftool/dbconfig/20250123-215828-marostegui.json
[21:58:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2158.codfw.wmnet with reason: Maintenance
[21:58:33] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[21:58:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:58:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72309 and previous config saved to /var/cache/conftool/dbconfig/20250123-215855-marostegui.json
[21:59:14] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd
[21:59:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f0feb1a-6c73-4906-bb5a-2df62eb7e156) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th...
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T2200)
[22:01:08] <cscott>	 https://en.wikipedia.org/wiki/User:Cscott/T382464 is the smoke test for this patch
[22:01:22] <cjming>	 cscott: on test servers for verifying
[22:01:47] <cscott>	 testing
[22:02:06] <logmsgbot>	 !log cjming@deploy2002 ssastry, cjming: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:02:16] <cscott>	 cjming: looks good
[22:02:24] <cjming>	 great - syncing
[22:02:28] <logmsgbot>	 !log cjming@deploy2002 ssastry, cjming: Continuing with sync
[22:03:10] <subbu>	 cscott, confirmed .. it looks good.
[22:05:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:09:22] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]] (duration: 11m 45s)
[22:09:27] <stashbot>	 T382464: Parsoid's list parsing seems to ignore one leading newline in templates causing rendering differences - https://phabricator.wikimedia.org/T382464
[22:09:44] <cjming>	 cscott: should be live!
[22:10:24] <cjming>	 !log end of UTC late backport window
[22:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:10:54] <subbu>	 cjming, thanks!
[22:11:06] <cjming>	 subbu: yw!
[22:11:42] <icinga-wm>	 PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10476 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops
[22:11:48] <icinga-wm>	 PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10480 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops
[22:12:51] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad
[22:13:09] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad
[22:20:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:27:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:29:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 20.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:30:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72310 and previous config saved to /var/cache/conftool/dbconfig/20250123-223057-marostegui.json
[22:31:02] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[22:34:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 19.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:46:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P72311 and previous config saved to /var/cache/conftool/dbconfig/20250123-224604-marostegui.json
[22:51:21] <brennen>	 cjming: thanks, as always, for handling backports
[22:51:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:01:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P72312 and previous config saved to /var/cache/conftool/dbconfig/20250123-230112-marostegui.json
[23:06:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490768 (10cmooney) The current configuration we have requires us to enable [[ https://gnmic.openconfig.net/user_guide/caching/ | gnmic caching ]], as we group certain me...
[23:11:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490804 (10cmooney) FWIW I used the config from P72314 in the most recent tests.  I'd tried to use some of the advice from [[ https://github.com/openconfig/gnmic/issues/4...
[23:12:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:16:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72315 and previous config saved to /var/cache/conftool/dbconfig/20250123-231619-marostegui.json
[23:16:25] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:16:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:16:35] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2169.codfw.wmnet with reason: Maintenance
[23:16:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72316 and previous config saved to /var/cache/conftool/dbconfig/20250123-231641-marostegui.json
[23:17:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:46:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72317 and previous config saved to /var/cache/conftool/dbconfig/20250123-234653-marostegui.json
[23:46:59] <stashbot>	 T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592
[23:48:50] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10490884 (10BCornwall) Hi, @RobH, thanks for doing this!  At first glance, nothing's improved. The inlet temps are acceptable at ~20° yet the CPUs are still hitting ~90°. O...