[00:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:15] PROBLEM - MariaDB Replica Lag: s1 on db1240 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:31] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:28] (03CR) 10Dzahn: [C:04-1] "I see! I think I missed those because I searched for gerrit.wikimedia.org. I looked again at the alertmanager yaml. You are right, there " [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [00:21:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487049 (10phaultfinder) [00:35:30] (03PS1) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594 [00:36:27] (03PS1) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817) [00:37:37] (03PS2) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817) [00:38:04] (03PS2) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594 [00:38:19] (03CR) 10Andrew Bogott: [C:03+2] cephosd.cfg partman: reduce minimum partition sizes [puppet] - 10https://gerrit.wikimedia.org/r/1113595 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott) [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596 [00:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596 (owner: 10TrainBranchBot) [00:41:49] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [00:42:13] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [00:42:18] (03CR) 10Dzahn: [C:04-1] "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113594 as an alternative suggestion to fix this - which would keep informin" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [00:44:06] !log removing 1 file for legal complaince [00:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:35] !log removing 2 files for legal compliance [00:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:30] (03PS1) 10Andrew Bogott: cephosd.cfg partman: reduce minimum partition sizes, again [puppet] - 10https://gerrit.wikimedia.org/r/1113597 (https://phabricator.wikimedia.org/T383817) [00:58:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113596 (owner: 10TrainBranchBot) [00:58:11] (03CR) 10Andrew Bogott: [C:03+2] cephosd.cfg partman: reduce minimum partition sizes, again [puppet] - 10https://gerrit.wikimedia.org/r/1113597 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott) [00:59:46] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [01:00:23] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [01:06:11] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [01:06:38] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [01:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598 [01:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598 (owner: 10TrainBranchBot) [01:21:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:22:55] (03PS1) 10Andrew Bogott: squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599 [01:23:06] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage [01:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487134 (10phaultfinder) [01:26:29] (03CR) 10Andrew Bogott: "No big deal here, I just noticed this because puppet started failing last week on a deployment-prep VM. Why last week, no idea." [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott) [01:27:00] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage [01:29:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113598 (owner: 10TrainBranchBot) [01:30:43] (03CR) 10Eevans: "Of course!" [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [01:36:09] (03CR) 10Dzahn: [C:03+1] "back from 2020 https://gerrit.wikimedia.org/r/c/operations/puppet/+/579915" [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott) [01:37:35] (03CR) 10Dzahn: Prometheus Squid exporter, specify proxy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [01:46:25] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/398a8379f919b36c3c30162c6ac61d37db0f3c5790eecdd4b618010ab98ee51e/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:49:27] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [01:50:00] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [02:02:31] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:06:25] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:50] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage [02:08:17] RECOVERY - MariaDB Replica Lag: s1 on db1240 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:15] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1012.eqiad.wmnet with reason: host reimage [02:30:02] (03PS2) 10Andrew Bogott: squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599 [02:30:02] (03PS1) 10Andrew Bogott: Update nic names for cloudceph1012/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113601 [02:30:54] (03CR) 10Andrew Bogott: [C:03+2] Update nic names for cloudceph1012/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113601 (owner: 10Andrew Bogott) [02:35:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [02:38:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487171 (10phaultfinder) [03:22:28] FIRING: SystemdUnitCrashLoop: logstash.service crashloop on elastic1088:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:27:28] FIRING: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [03:32:28] RESOLVED: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1074:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10487273 (10phaultfinder) [05:53:17] RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 60%, RTA = 30.84 ms [05:59:41] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [06:41:30] (03PS1) 10Marostegui: db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113718 (https://phabricator.wikimedia.org/T383709) [06:41:41] !log Powering off db2189 for onsite maintenance T383709 [06:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:46] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [06:42:11] (03CR) 10Marostegui: [C:03+2] db2189: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113718 (https://phabricator.wikimedia.org/T383709) (owner: 10Marostegui) [06:42:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2189.codfw.wmnet with reason: Onsite work [06:42:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2189 T383709', diff saved to https://phabricator.wikimedia.org/P72237 and previous config saved to /var/cache/conftool/dbconfig/20250123-064241-marostegui.json [06:50:03] (03CR) 10Marostegui: site.pp, db2134.yaml: db2134 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto) [06:55:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1021.eqiad.wmnet [06:56:37] (03PS1) 10Marostegui: es1021: Remove [puppet] - 10https://gerrit.wikimedia.org/r/1113719 (https://phabricator.wikimedia.org/T384418) [06:58:28] (03CR) 10Marostegui: [C:03+2] es1021: Remove [puppet] - 10https://gerrit.wikimedia.org/r/1113719 (https://phabricator.wikimedia.org/T384418) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0700) [07:00:05] marostegui and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0700). [07:01:58] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [07:08:11] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:08:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [07:08:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:08:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1021.eqiad.wmnet [07:09:31] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418#10487690 (10Marostegui) a:05Marostegui→03None [07:09:41] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418#10487694 (10Marostegui) This is ready for #dc-ops [07:13:32] (03PS1) 10Marostegui: instances.yaml: Remove es1022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113721 (https://phabricator.wikimedia.org/T384566) [07:14:17] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113721 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui) [07:15:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1022 from dbctl T384566', diff saved to https://phabricator.wikimedia.org/P72239 and previous config saved to /var/cache/conftool/dbconfig/20250123-071529-root.json [07:15:34] T384566: decommission es1022.eqiad.wmnet - https://phabricator.wikimedia.org/T384566 [07:16:56] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) [07:17:16] (03CR) 10CI reject: [V:04-1] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui) [07:17:37] (03PS2) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) [07:28:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [07:29:15] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [07:29:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10487743 (10ops-monitoring-bot) Draining ganeti2032.codfw.wmnet of running VMs [07:35:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P72240 and previous config saved to /var/cache/conftool/dbconfig/20250123-073557-marostegui.json [07:36:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2011.codfw.wmnet with reason: Kernel reboot [07:36:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: Kernel reboot [07:39:43] (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113445 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:44:37] (03CR) 10Muehlenhoff: [C:03+2] Extend comment [puppet] - 10https://gerrit.wikimedia.org/r/1113487 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:47:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1 after kernel reboots', diff saved to https://phabricator.wikimedia.org/P72241 and previous config saved to /var/cache/conftool/dbconfig/20250123-074759-marostegui.json [07:48:02] (03CR) 10Jelto: [C:03+1] "lgtm, `0.3.4` is the "old" version with coredns `1.8.7`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [07:49:35] (03PS1) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) [07:50:03] (03PS2) 10Muehlenhoff: Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) [07:55:37] (03PS2) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113461 [07:55:37] (03PS1) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113741 [08:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0800). nyaa~ [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:29] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 13739MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [08:00:35] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for new master_bookworm roles [labs/private] - 10https://gerrit.wikimedia.org/r/1113740 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:03:21] !log installing glibc updates on bullseye [08:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:31] (03PS2) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) [08:19:24] (03PS1) 10AikoChou: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) [08:21:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:56] (03PS3) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) [08:21:56] (03PS1) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919) [08:21:58] (03PS1) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (Step 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) [08:22:45] (03PS2) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) [08:25:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc2 for kernel reboots', diff saved to https://phabricator.wikimedia.org/P72242 and previous config saved to /var/cache/conftool/dbconfig/20250123-082545-marostegui.json [08:26:01] (03PS1) 10DCausse: wdqs: cleanup unused settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113745 (https://phabricator.wikimedia.org/T374919) [08:26:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1012.eqiad.wmnet with reason: Kernel reboot [08:26:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2012.codfw.wmnet with reason: Kernel reboot [08:28:03] (03PS9) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [08:33:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:35:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc2 after kernel reboots', diff saved to https://phabricator.wikimedia.org/P72244 and previous config saved to /var/cache/conftool/dbconfig/20250123-083524-marostegui.json [08:46:05] (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [08:47:24] (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [08:48:09] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:49:20] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [08:50:06] (03CR) 10Filippo Giunchedi: [C:03+1] Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [08:52:13] (03CR) 10Filippo Giunchedi: [C:03+1] librenms: Ensure the cache/data directory belongs to librenms [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) (owner: 10Andrea Denisse) [08:52:43] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [08:52:57] (03PS1) 10Muehlenhoff: profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) [08:53:33] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [08:57:27] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [08:57:59] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:00:15] brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0900). [09:09:32] (03CR) 10David Caro: [C:03+1] "LGTM, just the nit about the naming (feel free to ignore)" [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [09:17:20] (03CR) 10JMeybohm: [C:03+2] Pin coredns version on all clustes to 0.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:17:50] (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:20:03] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui) [09:21:15] (03Merged) 10jenkins-bot: Pin coredns version on all clustes to 0.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:21:54] (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113743 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:22:19] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:22:41] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:26:17] (03CR) 10Marostegui: [C:03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113722 (https://phabricator.wikimedia.org/T384566) (owner: 10Marostegui) [09:27:06] (03CR) 10Jelto: [C:03+1] "looks good to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:30:38] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [09:32:41] (03CR) 10Jelto: [C:03+1] "🍿" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:33:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:35:15] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_ulsfo [09:36:25] (03PS2) 10Federico Ceratto: site.pp, db2134.yaml: db2134 [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) [09:36:41] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (206752s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [09:37:40] (03PS1) 10DCausse: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) [09:39:52] (03PS1) 10Vgutierrez: Revert^2 "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113751 [09:40:03] (03PS1) 10JMeybohm: Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) [09:43:00] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113751 (owner: 10Vgutierrez) [09:45:06] 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10487931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 0.14.1 is out, I'll import and upgrade [09:45:32] (03CR) 10Marostegui: [C:03+1] "Looks good, remember to merge this AFTER the script has run" [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto) [09:45:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [09:49:29] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: index wikitech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113462 (owner: 10DCausse) [09:50:46] (03Merged) 10jenkins-bot: cirrus-streaming-updater: index wikitech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113462 (owner: 10DCausse) [09:51:01] (03PS2) 10Btullis: Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947) [09:51:19] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:51:37] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:53:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [09:53:50] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp [09:54:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_ulsfo [09:55:00] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:55:32] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2032.codfw.wmnet with reason: remove from cluster for reimage [09:55:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10487951 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93df70a9-c65f-4aaf-8a3d-5ab698636ed0) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [09:57:50] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113465 (owner: 10Muehlenhoff) [10:01:47] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_magru [10:01:55] (03CR) 10Btullis: [C:03+2] Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [10:02:19] !log btullis@dns1004 START - running authdns-update [10:04:10] !log btullis@dns1004 END - running authdns-update [10:05:13] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:05:22] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:08:48] !log installing routinator security updates [10:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:03] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [10:14:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2032.codfw.wmnet with OS bookworm [10:14:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2032.codfw.wmnet with OS bookworm [10:15:37] (03PS2) 10JMeybohm: Update cert-manager to 1.16.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113752 (https://phabricator.wikimedia.org/T341984) [10:16:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:12] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp [10:19:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_magru [10:22:00] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:24:11] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:24:33] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:26:36] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ms-be2075.codfw.wmnet with reason: hardware broken awaiting vendor action [10:26:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10488032 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=62b3cb8f-dcae-4290-af1d-2a50d3785cb2) set by mvernon@cumin2002 for 7 days, 0:00:00 on 1 host(s) and t... [10:32:21] 06SRE, 10SRE-Access-Requests: Add kemayo to the deployment group - https://phabricator.wikimedia.org/T384493#10488038 (10jcrespo) 05Open→03Resolved a:05jcrespo→03CDanis [10:32:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage [10:32:58] (03PS6) 10Jcrespo: admin: Deploy WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [10:35:00] (03CR) 10Jelto: "Looks mostly good, I left some comments about the redundant name suffixes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage [10:39:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse) [10:39:03] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:39:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse) [10:41:10] (03CR) 10Jcrespo: [C:03+2] "I saw noone objecting to both the patch and the docs, so merging." [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) (owner: 10Jcrespo) [10:43:39] (03PS1) 10Urbanecm: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) [10:43:50] (03PS1) 10Urbanecm: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) [10:44:34] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31), 13Patch-For-Review: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10488103 (10jcrespo) 05Open→03Resolved This is now applied. [10:46:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [10:46:23] jouncebot: nowandnext [10:46:23] For the next 0 hour(s) and 13 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T0900) [10:46:23] In 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1100) [10:46:38] I'm going to deploy a fix for a train blocker [10:46:50] (03CR) 10Urbanecm: [C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [10:46:54] (03CR) 10Urbanecm: [C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [10:46:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:09] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488137 (10jcrespo) @Neslihan_Turan_WMDE This is still blocked on you providing an email and your developer (Gerrit/IDM/LDAP) account id. [10:50:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [10:50:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [10:50:52] (03CR) 10Filippo Giunchedi: "FYI this is causing PuppetConstantChange alerts on k8s hosts due to" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [10:51:26] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:17] jayme: ^ FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112782/comments/428c5bfb_d0cab2ae [10:56:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2032.codfw.wmnet with OS bookworm [10:56:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2032.codfw.wmnet with OS bookworm completed: - ganeti203... [10:57:52] !log pausing media backups on eqiad for maintenance T383902 [10:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:56] T383902: Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them - https://phabricator.wikimedia.org/T383902 [11:00:14] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1100) [11:04:00] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1204.eqiad.wmnet with reason: os upgrade [11:04:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [11:04:45] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1205.eqiad.wmnet with reason: os upgrade [11:06:41] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1204.eqiad.wmnet with OS bookworm [11:08:06] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_eqsin [11:09:19] (03PS2) 10FNegri: wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) [11:09:25] (03Merged) 10jenkins-bot: Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113760 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [11:09:26] (03CR) 10CI reject: [V:04-1] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [11:09:48] (03CR) 10FNegri: wmcs: update kernel alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:12:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [11:13:00] (03CR) 10Urbanecm: [V:03+2 C:03+2] Remove GEInfoboxTemplatesTest [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [11:13:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113761 (https://phabricator.wikimedia.org/T384254) (owner: 10Urbanecm) [11:13:14] (03CR) 10Brouberol: [C:03+2] airflow-wmde: remove extra network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109926 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [11:13:21] (03CR) 10Brouberol: [C:03+2] airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [11:14:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2032.codfw.wmnet to cluster codfw and group B [11:14:12] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] [11:14:16] T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254 [11:14:49] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2032.codfw.wmnet to cluster codfw and group B [11:17:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:18:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:19:43] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:19:48] T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254 [11:23:30] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [11:25:23] (03PS1) 10Vgutierrez: secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764 [11:25:41] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add dummy pki.goog private key [labs/private] - 10https://gerrit.wikimedia.org/r/1113764 (owner: 10Vgutierrez) [11:25:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488255 (10MoritzMuehlenhoff) [11:26:40] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [11:28:21] 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488267 (10fgiunchedi) I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top of my head would be to have a map network... [11:28:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [11:28:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_eqsin [11:28:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488268 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs [11:29:07] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 [11:30:23] (03PS1) 10Vgutierrez: hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) [11:31:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [11:31:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2136 T384479', diff saved to https://phabricator.wikimedia.org/P72247 and previous config saved to /var/cache/conftool/dbconfig/20250123-113157-fceratto.json [11:32:02] T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479 [11:33:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [11:33:49] (03CR) 10Zoe: [C:03+1] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot) [11:34:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [11:34:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10488279 (10ops-monitoring-bot) Draining ganeti2022.codfw.wmnet of running VMs [11:34:42] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_codfw [11:35:02] !log urbanecm@deploy2002 Sync cancelled. [11:35:45] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] [11:35:49] T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254 [11:37:40] !log upload acme-chief 0.38 to apt.wm.org (bookworm-wikimedia) [11:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:26] (03PS1) 10Vgutierrez: hiera: Issue unified cert using pki.goog [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) [11:47:02] (03CR) 10David Caro: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:47:24] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:47:37] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:48:22] (03CR) 10David Caro: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [11:48:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [11:49:42] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1204.eqiad.wmnet with OS bookworm [11:51:02] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488310 (10Neslihan_Turan_WMDE) Hi, yesterday a problem about my Wikitech account has been fixed. I think now we should be able to proceed. My WMDE email adress is neslihan.turan@wiki... [11:51:09] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113761|Remove GEInfoboxTemplatesTest (T384254)]], [[gerrit:1113760|Remove GEInfoboxTemplatesTest (T384254)]] (duration: 15m 23s) [11:51:13] T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254 [11:51:16] finally [11:51:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [11:51:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10488314 (10kamila) [11:52:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_codfw [11:53:50] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_drmrs [11:54:39] (03PS1) 10Muehlenhoff: Switch ganeti2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113769 [11:59:12] (03PS1) 10Kamila Součková: wikikube: rename parse100[1-6] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571) [12:00:19] !log fceratto@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2134.codfw.wmnet [12:05:14] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [12:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:49] there's a pending change in DNS for wmf6779 https://phabricator.wikimedia.org/P72248 [12:11:38] (03PS3) 10Btullis: dumps: Configure snapshot1012 with the dumps trait [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947) [12:12:54] !log fceratto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2134.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:13:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2134.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1002" [12:13:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:13:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2134.codfw.wmnet [12:13:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_drmrs [12:14:07] (03CR) 10Federico Ceratto: [C:03+1] site.pp, db2134.yaml: db2134 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto) [12:14:22] (03CR) 10Federico Ceratto: [C:03+1] "decommission script ran, merging." [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto) [12:14:25] (03CR) 10Federico Ceratto: [C:03+2] site.pp, db2134.yaml: db2134 [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) (owner: 10Federico Ceratto) [12:14:37] (03CR) 10Cathal Mooney: [C:03+2] Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [12:15:59] (03CR) 10Hnowlan: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot) [12:16:18] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [12:16:49] (03CR) 10Fabfur: [C:03+1] hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [12:17:10] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113765 (owner: 10PipelineBot) [12:17:12] !log Removing db2134 from zarcillo T384476 [12:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:17] T384476: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476 [12:18:38] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488500 (10FCeratto-WMF) 05In progress→03Open [12:19:25] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488505 (10FCeratto-WMF) [12:19:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10488511 (10FCeratto-WMF) Ready for DC ops to decommission [12:19:51] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:38] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:21:51] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:23:46] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:23:54] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:24:21] (03PS1) 10Giuseppe Lavagetto: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) [12:25:00] (03CR) 10CI reject: [V:04-1] DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [12:26:13] 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488524 (10cmooney) >>! In T384345#10488267, @fgiunchedi wrote: > I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top... [12:28:39] 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488529 (10cmooney) [12:28:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488530 (10cmooney) [12:28:46] (03PS2) 10Giuseppe Lavagetto: DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) [12:31:12] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488538 (10jcrespo) Thank you. @KFrancis you have their provided email above: neslihan.turan@wikimedia.de @Neslihan_Turan_WMDE Please note the uid identifier associated with that em... [12:31:51] !log Deploy schema change on s8 codfw with replication dbmaint T384592 [12:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:56] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:34:55] (03CR) 10Ladsgroup: [C:03+1] DBRecordCache: handle default section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113788 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [12:36:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:37:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:37:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72249 and previous config saved to /var/cache/conftool/dbconfig/20250123-123708-marostegui.json [12:37:13] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [12:39:14] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [12:41:16] !log restarting gnmic.service on netflow1002 [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:31] FIRING: [2x] Emergency syslog message: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:46:42] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:49:31] RESOLVED: [2x] Emergency syslog message: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:51:21] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [12:51:28] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a6b392ba-8b36-4fa0-8d3d-10c8b2d2eb48) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [12:51:59] (03CR) 10Brouberol: [C:03+1] dumps: Configure snapshot1012 with the dumps trait (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300) [13:02:20] (03PS1) 10Ladsgroup: file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) [13:02:36] jouncebot: nowandnext [13:02:36] For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300) [13:02:36] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400) [13:02:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72250 and previous config saved to /var/cache/conftool/dbconfig/20250123-130253-root.json [13:03:04] is there any deployment for Mobileapps/RESTBase/Wikifeeds happening? [13:03:10] (03CR) 10Ladsgroup: [C:03+2] file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [13:03:14] (03PS2) 10Muehlenhoff: sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 [13:04:14] PROBLEM - MariaDB Replica SQL: s2 #page on db1222 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: ptwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:04:23] ^ taking it [13:04:28] thanks [13:04:47] (03PS1) 10JMeybohm: Pin cert-manager version on all clustes to 1.10.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113800 (https://phabricator.wikimedia.org/T341984) [13:04:49] (03PS1) 10JMeybohm: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 [13:04:50] !incidents [13:04:51] 5626 (UNACKED) db1222 (paged)/MariaDB Replica SQL: s2 (paged) [13:04:51] 5625 (RESOLVED) ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams) [13:04:56] !ack 5626 [13:04:56] 5626 (ACKED) db1222 (paged)/MariaDB Replica SQL: s2 (paged) [13:05:02] marostegui: thanks <3 [13:05:30] marostegui: tx [13:05:33] (should I resolve the p.age for that?) [13:05:33] This is eqiad master, I will fix it to restart replication and schedule a master switch [13:05:38] Emperor: please go ahead yes [13:05:43] !resolve 5626 [13:05:43] 5626 (RESOLVED) db1222 (paged)/MariaDB Replica SQL: s2 (paged) [13:05:46] the recovery should be arriving in a bit [13:06:14] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host parse[1001-1006].eqiad.wmnet [13:06:15] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename parse100[1-6] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113770 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:06:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597) [13:07:14] RECOVERY - MariaDB Replica SQL: s2 #page on db1222 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:07:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Index [13:08:03] there is lag on another s2 host: db1182 [13:08:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10488665 (10Jhancock.wm) a:03Jhancock.wm [13:08:19] all of them are lagging jynus, as it was the intermediate master [13:08:25] should recover soon [13:08:27] I get it now [13:08:35] I downtimed it for 1h though [13:08:39] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10488671 (10Jhancock.wm) a:03Jhancock.wm [13:08:40] So it doesn't bother oncall [13:08:52] https://phabricator.wikimedia.org/T384597 task for the switchover [13:09:04] I'm around if you need me for anything [13:09:15] nah it is all good Amir1 [13:09:26] I am worried that prometheus didn't get that lag [13:09:34] only icinga, so there is a regression there [13:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1162 T384597', diff saved to https://phabricator.wikimedia.org/P72251 and previous config saved to /var/cache/conftool/dbconfig/20250123-130937-marostegui.json [13:09:42] T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597 [13:09:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host parse[1001-1006].eqiad.wmnet [13:10:18] jouncebot: nowandnext [13:10:18] For the next 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1300) [13:10:19] In 0 hour(s) and 49 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400) [13:10:36] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1001 to wikikube-worker1142 [13:10:56] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:11:16] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10488675 (10Jhancock.wm) [13:11:26] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: Add link to the help text for move-vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/1112171 (owner: 10Muehlenhoff) [13:11:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:12:18] (03CR) 10DCausse: [C:03+2] wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:13:18] lag in eqiad s2 all good now [13:13:24] Currently rebooting the candidate master [13:13:33] To get it ready to become a dc master soonish [13:13:59] (03Merged) 10jenkins-bot: wdqs: bump to 0.3.154 and enable event utilities APIs (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113744 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [13:14:24] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [13:14:34] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1001 to wikikube-worker1142 - kamila@cumin1002" [13:14:44] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1002 to wikikube-worker1143 [13:14:48] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c1a-eqiad [13:14:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c1a-eqiad [13:14:49] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [13:14:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1001 to wikikube-worker1142 - kamila@cumin1002" [13:14:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:14:50] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1142 [13:15:05] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:15:31] (03PS2) 10Elukey: drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 [13:15:41] (03CR) 10CI reject: [V:04-1] file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [13:15:46] (03CR) 10Elukey: drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [13:16:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1142 [13:16:25] (03CR) 10Elukey: [C:03+1] profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:16:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1001 to wikikube-worker1142 [13:17:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1162.eqiad.wmnet with reason: Index rebuild [13:17:15] (03CR) 10Elukey: [C:03+1] benthos: add nocookies and tls session metadata [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi) [13:17:52] (03PS1) 10Marostegui: Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113804 [13:17:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72252 and previous config saved to /var/cache/conftool/dbconfig/20250123-131758-root.json [13:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72253 and previous config saved to /var/cache/conftool/dbconfig/20250123-131805-root.json [13:18:26] (03CR) 10Marostegui: [C:03+2] Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113804 (owner: 10Marostegui) [13:18:40] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1002 to wikikube-worker1143 - kamila@cumin1002" [13:18:54] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1003 to wikikube-worker1144 [13:18:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1002 to wikikube-worker1143 - kamila@cumin1002" [13:18:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:18:59] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1143 [13:19:14] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:19:48] (03CR) 10Elukey: [C:03+1] Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm) [13:20:04] (03Merged) 10jenkins-bot: file: Add caller to write queries [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113799 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [13:20:07] (03CR) 10Btullis: [V:03+1 C:03+2] dumps: Configure snapshot1012 with the dumps trait [puppet] - 10https://gerrit.wikimedia.org/r/1113475 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [13:20:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1143 [13:20:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1002 to wikikube-worker1143 [13:20:59] (03PS1) 10Elukey: mapnik: skip copying mapnik files to /usr/local [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285) [13:21:33] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]] [13:21:37] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [13:21:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:24] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1003 to wikikube-worker1144 - kamila@cumin1002" [13:23:36] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1004 to wikikube-worker1145 [13:23:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1003 to wikikube-worker1144 - kamila@cumin1002" [13:23:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:42] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1144 [13:23:56] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:24:35] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:39] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10488719 (10phaultfinder) [13:24:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1144 [13:25:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1003 to wikikube-worker1144 [13:26:25] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488724 (10Neslihan_Turan_WMDE) Yes, that's me @jcrespo [13:27:18] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488725 (10jcrespo) [13:28:15] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1004 to wikikube-worker1145 - kamila@cumin1002" [13:28:39] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1005 to wikikube-worker1146 [13:28:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1004 to wikikube-worker1145 - kamila@cumin1002" [13:28:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:43] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1145 [13:28:59] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:29:30] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10488746 (10jcrespo) Thank you, now only waiting on NDA to be filled in and we can apply the privilege change. I am sorry to hear you had problems with Wikitech, apparently the migrat... [13:29:33] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [13:29:38] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488748 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f0f61f83-b1f7-48c8-9e4a-2e436917a7d3) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [13:30:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1145 [13:30:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1004 to wikikube-worker1145 [13:31:17] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113799|file: Add caller to write queries (T384481)]] (duration: 09m 43s) [13:31:21] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [13:31:50] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_esams [13:33:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72255 and previous config saved to /var/cache/conftool/dbconfig/20250123-133304-root.json [13:33:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72256 and previous config saved to /var/cache/conftool/dbconfig/20250123-133311-root.json [13:36:44] (03CR) 10Jelto: [C:03+1] "lgtm" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) (owner: 10JMeybohm) [13:37:19] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:26] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1005 to wikikube-worker1146 - kamila@cumin1002" [13:38:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1005 to wikikube-worker1146 - kamila@cumin1002" [13:38:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:38:28] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1146 [13:38:57] FYI dcausse and others, I will not be able to do the backport+config window today, sorry [13:39:23] * TheresNoTime can do! [13:39:23] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:02] (03CR) 10Andrew Bogott: [C:03+2] squid_exporter: make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/1113599 (owner: 10Andrew Bogott) [13:41:03] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from parse1006 to wikikube-worker1147 [13:41:16] !log bounce mtail on centrallog2002 - high system cpu usage and perf top reports native_queued_spin_lock_slowpath [13:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:24] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:41:31] (03PS1) 10Federico Ceratto: instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) [13:42:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1146 [13:42:03] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep hiera: remove uses of .eqiad.wmflabs tld [puppet] - 10https://gerrit.wikimedia.org/r/1113468 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [13:42:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1005 to wikikube-worker1146 [13:43:22] (03PS2) 10AikoChou: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) [13:44:05] (03CR) 10Vgutierrez: [C:03+2] hiera: Deploy pki.goog account on acmechief hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113766 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [13:46:32] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: reimage [13:47:10] (03CR) 10Muehlenhoff: [C:03+2] profile::maps::osm_master: Make tilerator_pass optional [puppet] - 10https://gerrit.wikimedia.org/r/1113746 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:48:23] (03CR) 10Muehlenhoff: [C:03+2] Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:48:32] (03PS10) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [13:48:40] (03CR) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:48:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:48:49] (03CR) 10Marostegui: [C:03+1] "Remember that once this is merged, you'll have to go to any cumin host and commit the change." [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto) [13:49:31] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_esams [13:49:56] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-restart-tcp-mss-clamper rolling restart_daemons on A:cp-text_eqiad [13:50:44] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1006 to wikikube-worker1147 - kamila@cumin1002" [13:50:54] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1239.eqiad.wmnet with OS bookworm [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming parse1006 to wikikube-worker1147 - kamila@cumin1002" [13:52:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:16] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1147 [13:52:51] federico3: There is a change from db2140 waiting to be merged in dbctl [13:53:23] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:53:36] federico3: I assume it is your for the decomm of db2140? [13:53:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [13:53:46] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:54:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1147 [13:55:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from parse1006 to wikikube-worker1147 [13:56:14] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [13:56:23] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 0.1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:50] interesting, I suppose it could be a timeout due to the cookbook waiting for confirmation...mabye? [13:56:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool db2140 T384480', diff saved to https://phabricator.wikimedia.org/P72257 and previous config saved to /var/cache/conftool/dbconfig/20250123-135655-fceratto.json [13:56:59] T384480: decommission db2140.codfw.wmnet - https://phabricator.wikimedia.org/T384480 [13:57:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72258 and previous config saved to /var/cache/conftool/dbconfig/20250123-135704-root.json [13:57:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72259 and previous config saved to /var/cache/conftool/dbconfig/20250123-135704-root.json [13:57:30] federico3: which cookbook? [13:57:55] depooling db2140 [13:58:01] federico3: My guess is that you did dbctl instance db2140 depool but didn't issue the dbctl config commit -m "blablabl" to commit the change [13:58:07] (03PS1) 10Muehlenhoff: osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565) [13:58:09] So the change is pending to be committed [13:58:24] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:58:34] federico3: If a change isn't committed, it will block all the other pending changes [13:58:46] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:58:49] (03CR) 10JMeybohm: [C:03+2] "Oops. I'll clean that up manually - it's only staging-codfw hosts that have files in there. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112782 (https://phabricator.wikimedia.org/T365687) (owner: 10JMeybohm) [13:59:40] I started the depooling cookbook, then it asked me for final confirmation "Enter y or yes to confirm:" and I was checking with you [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1400). [14:00:05] dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] federico3: yeah, I guess it went above the threshold of the alert [14:00:24] and that's why it fired [14:00:44] o/ can deploy [14:00:52] o/ [14:01:05] (03CR) 10Jelto: "thanks for adding the new team! One comment on-line regarding the different receivers." [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [14:01:41] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [14:03:30] dcausse: `Change '1113462', project 'operations/deployment-charts', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.44.0-wmf.12', '1.44.0-wmf.13']` — issue with the "depends-on" ? [14:03:53] TheresNoTime: looking [14:05:14] https://phabricator.wikimedia.org/P72260 for ref (also bug with the 'N' selection, heh) [14:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool db1165', diff saved to https://phabricator.wikimedia.org/P72261 and previous config saved to /var/cache/conftool/dbconfig/20250123-140524-marostegui.json [14:05:29] TheresNoTime: did not know that scap would complain if Depends-On was on a non MW repo [14:05:33] will remove [14:05:46] ack, haven't seen that before personally! [14:06:30] (03PS2) 10DCausse: cirrus: stop writing to wikitech index from the MW JobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 [14:06:30] (03PS2) 10DCausse: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) [14:06:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:06:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:06:44] TheresNoTime: uploaded new ones, and please feel free to deploy both of them at once [14:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72262 and previous config saved to /var/cache/conftool/dbconfig/20250123-140649-marostegui.json [14:06:53] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:07:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-restart-tcp-mss-clamper (exit_code=0) rolling restart_daemons on A:cp-text_eqiad [14:07:38] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage [14:07:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse) [14:07:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse) [14:08:34] (03CR) 10Klausman: [C:03+1] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212 (owner: 10BCornwall) [14:08:37] (03Merged) 10jenkins-bot: cirrus: stop writing to wikitech index from the MW JobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 (owner: 10DCausse) [14:08:40] (03Merged) 10jenkins-bot: cirrus: cleanup unused settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113750 (https://phabricator.wikimedia.org/T374702) (owner: 10DCausse) [14:09:11] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]] [14:09:15] T374702: Cleanup: Remove deprecated weighted tag methods - https://phabricator.wikimedia.org/T374702 [14:09:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72263 and previous config saved to /var/cache/conftool/dbconfig/20250123-140957-marostegui.json [14:10:34] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: host reimage [14:12:07] (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [14:12:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72264 and previous config saved to /var/cache/conftool/dbconfig/20250123-141209-root.json [14:12:58] (03PS1) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [14:12:59] TheresNoTime: when deploying there might be few warnings in the log "Received {$jobType} job with {$updateGroup} updates for an unwritable cluster $cluster." these are expected and can be ignored [14:13:26] ack [14:13:48] !log samtar@deploy2002 dcausse, samtar: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:13:49] but I suspect there won't be that much, it would be only pending writes from wikitech which I suspect is not that many [14:14:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff) [14:14:08] dcausse: anything you need to test further? ^ [14:14:12] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [14:14:12] TheresNoTime: this can't be tested on test servers [14:14:13] (03PS2) 10Muehlenhoff: Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) [14:14:19] !log samtar@deploy2002 dcausse, samtar: Continuing with sync [14:16:04] (03PS1) 10Cathal Mooney: Revert "Add BGP data collection from network devices over GNMI" This reverts commit a8bc5da977f0de2aa87e0060b40df3197240189c. [puppet] - 10https://gerrit.wikimedia.org/r/1113812 [14:16:26] (03CR) 10CI reject: [V:04-1] Revert "Add BGP data collection from network devices over GNMI" This reverts commit a8bc5da977f0de2aa87e0060b40df3197240189c. [puppet] - 10https://gerrit.wikimedia.org/r/1113812 (owner: 10Cathal Mooney) [14:18:34] (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:20:00] (03PS2) 10Cathal Mooney: Revert "Add BGP data collection from network devices over GNMI" [puppet] - 10https://gerrit.wikimedia.org/r/1113812 [14:20:27] (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:20:30] (03CR) 10Cathal Mooney: [C:03+2] Revert "Add BGP data collection from network devices over GNMI" [puppet] - 10https://gerrit.wikimedia.org/r/1113812 (owner: 10Cathal Mooney) [14:20:36] (03CR) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:20:48] (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:21:11] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113463|cirrus: stop writing to wikitech index from the MW JobQueue]], [[gerrit:1113750|cirrus: cleanup unused settings (T374702)]] (duration: 12m 00s) [14:21:16] T374702: Cleanup: Remove deprecated weighted tag methods - https://phabricator.wikimedia.org/T374702 [14:21:23] dcausse: live :) [14:21:29] TheresNoTime: thanks! :) [14:21:49] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1142.eqiad.wmnet wikikube-worker1143.eqiad.wmnet wikikube-worker1144.eqiad.wmnet wikikube-worker1145.eqiad.wmnet wikikube-worker1146.eqiad.wmnet wikikube-worker1147.eqiad.wmnet on all recursors [14:21:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1142.eqiad.wmnet wikikube-worker1143.eqiad.wmnet wikikube-worker1144.eqiad.wmnet wikikube-worker1145.eqiad.wmnet wikikube-worker1146.eqiad.wmnet wikikube-worker1147.eqiad.wmnet on all recursors [14:21:59] (03PS2) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [14:22:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:13] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [14:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10488868 (10phaultfinder) [14:25:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P72266 and previous config saved to /var/cache/conftool/dbconfig/20250123-142504-marostegui.json [14:26:07] (03CR) 10Elukey: [V:03+2 C:03+2] mapnik: skip copying mapnik files to /usr/local [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113805 (https://phabricator.wikimedia.org/T384285) (owner: 10Elukey) [14:26:19] !log UTC afternoon backport window done [14:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:27] (03CR) 10Jelto: [C:03+1] "this looks good and should add the missing team and receiver, minor concerns in-line." [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [14:26:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff) [14:27:01] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1142.eqiad.wmnet with OS bookworm [14:27:04] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1142 [14:27:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1142 [14:27:14] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1143.eqiad.wmnet with OS bookworm [14:27:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72267 and previous config saved to /var/cache/conftool/dbconfig/20250123-142715-root.json [14:27:17] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1143 [14:27:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1143 [14:27:24] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1144.eqiad.wmnet with OS bookworm [14:27:28] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1144 [14:27:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1144 [14:27:31] (03CR) 10Elukey: [C:03+1] osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:27:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1145.eqiad.wmnet with OS bookworm [14:27:36] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1145 [14:27:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1145 [14:27:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:27:49] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1146.eqiad.wmnet with OS bookworm [14:27:52] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1146 [14:27:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1146 [14:28:03] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1147.eqiad.wmnet with OS bookworm [14:28:06] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1147 [14:28:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1147 [14:28:44] (03PS1) 10FNegri: base::cloud_production: fix dep name [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) [14:28:48] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS6460 [14:28:48] Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS6 [14:28:48] 6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:58] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:30:49] (03PS2) 10JMeybohm: Update cert-manager to 1.16.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113801 [14:31:16] (03CR) 10AikoChou: [C:03+2] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [14:31:44] (03PS2) 10FNegri: base::cloud_production: fix dep name [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) [14:31:51] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:32:23] (03Merged) 10jenkins-bot: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113742 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [14:33:22] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1239.eqiad.wmnet with OS bookworm [14:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:36:06] (03PS3) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [14:36:16] dhinus: https://puppetboard.wikimedia.org/failures [14:36:34] (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [14:36:42] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff) [14:36:52] > Could not find declared class prometheus::node_kernel_panic [14:36:56] I think this might be related to the recent change [14:37:20] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [14:37:57] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:32] (03PS3) 10FNegri: prometheus::node_kernel_messages: fix "absent" params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) [14:39:06] !log updating acme-chief on acmechief1002 [14:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:19] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:40:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P72269 and previous config saved to /var/cache/conftool/dbconfig/20250123-144011-marostegui.json [14:41:16] (03PS4) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [14:41:47] (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [14:42:09] (03CR) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [14:42:11] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete WMCS Puppet 5 master classes no longer used/needed [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:42:23] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488927 (10cmooney) So I rolled-back the patch to collect the BGP metrics. The config puppet produced worked fine in magru and esams, but for some reason in eqiad stats... [14:42:30] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [14:42:53] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1144.eqiad.wmnet with reason: host reimage [14:42:54] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1142.eqiad.wmnet with reason: host reimage [14:43:01] (03PS4) 10FNegri: prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) [14:43:06] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1143.eqiad.wmnet with reason: host reimage [14:43:08] (03PS2) 10Muehlenhoff: Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) [14:43:19] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1145.eqiad.wmnet with reason: host reimage [14:43:25] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:43:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1146.eqiad.wmnet with reason: host reimage [14:43:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1147.eqiad.wmnet with reason: host reimage [14:43:55] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue unified cert using pki.goog [puppet] - 10https://gerrit.wikimedia.org/r/1113768 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [14:44:14] sukhe: yep on it [14:44:29] <3 [14:44:31] I pushed a change without checking PCC first :/ [14:45:16] most of us have been there 😅 [14:45:30] dhinus: all good, not the first time, not the last, been there done that :) [14:45:37] :D [14:46:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1144.eqiad.wmnet with reason: host reimage [14:46:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [14:47:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:48:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:48:47] (03CR) 10Andrew Bogott: [C:03+1] prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:50:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1145.eqiad.wmnet with reason: host reimage [14:50:07] (03CR) 10FNegri: [C:03+2] prometheus::node_kernel_messages: fix timer params [puppet] - 10https://gerrit.wikimedia.org/r/1113814 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:00] (03PS2) 10JMeybohm: Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) [14:52:00] (03PS11) 10JMeybohm: Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [14:52:00] (03PS2) 10JMeybohm: Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) [14:52:01] (03PS3) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) [14:53:09] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: demonstration - bking@cumin2002 - T380752 [14:53:15] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [14:53:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1146.eqiad.wmnet with reason: host reimage [14:53:34] (03PS5) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [14:54:47] (03PS3) 10DCausse: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) [14:55:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T384592)', diff saved to https://phabricator.wikimedia.org/P72270 and previous config saved to /var/cache/conftool/dbconfig/20250123-145518-marostegui.json [14:55:23] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:55:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:55:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72271 and previous config saved to /var/cache/conftool/dbconfig/20250123-145540-marostegui.json [14:56:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1143.eqiad.wmnet with reason: host reimage [14:58:01] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, we don't have a specified way for multiple teams at the moment, though what you did seems fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [14:58:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:59:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1147.eqiad.wmnet with reason: host reimage [14:59:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10489014 (10Jhancock.wm) dell update. it's been escalated to the level 3 helpdesk. might not hear back from them until monday. [15:00:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10489035 (10MatthewVernon) Thanks for the update! [15:01:23] (03PS3) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) [15:02:57] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72272 and previous config saved to /var/cache/conftool/dbconfig/20250123-150351-marostegui.json [15:03:56] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:04:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1142.eqiad.wmnet with reason: host reimage [15:05:56] (03PS1) 10Vgutierrez: haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) [15:06:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1144.eqiad.wmnet with OS bookworm [15:09:52] (03CR) 10Raymond Ndibe: "Tested on puppet server. file was removed from toolsbeta worker nfs 9 node. Also confirmed that a pod can still be scheduled (so doesn't a" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [15:11:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1145.eqiad.wmnet with OS bookworm [15:11:01] (03CR) 10Raymond Ndibe: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [15:12:57] (03PS2) 10Vgutierrez: haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) [15:13:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1146.eqiad.wmnet with OS bookworm [15:13:22] (03CR) 10FNegri: [C:03+2] wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [15:13:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:14:33] (03Merged) 10jenkins-bot: wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [15:15:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1143.eqiad.wmnet with OS bookworm [15:18:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1147.eqiad.wmnet with OS bookworm [15:18:13] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez) [15:18:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P72273 and previous config saved to /var/cache/conftool/dbconfig/20250123-151858-marostegui.json [15:19:40] jouncebot nowandnext [15:19:40] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [15:19:40] In 0 hour(s) and 40 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600) [15:19:53] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:21:29] !log 1.44.0-wmf.13 train (T382364): unblocked, rolling to group1 [15:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:33] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [15:22:23] (03CR) 10AOkoth: "I think the batch/v1 is pretty stable: https://kubernetes.io/docs/reference/using-api/deprecation-guide/ from reading this. We can test as" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [15:23:03] (03CR) 10Scott French: [C:03+1] "Thanks, Effie!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:23:10] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364) [15:23:11] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [15:23:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1142.eqiad.wmnet with OS bookworm [15:23:58] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113821 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [15:24:08] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:24:37] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:27:57] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:27:59] jouncebot: nowandnext [15:27:59] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [15:27:59] In 0 hour(s) and 32 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600) [15:29:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10489148 (10kamila) [15:30:43] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2134.codfw.wmnet - https://phabricator.wikimedia.org/T384476#10489162 (10Jhancock.wm) 05Open→03Resolved a:05FCeratto-WMF→03Jhancock.wm [15:31:26] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:31:27] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:31:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10489182 (10phaultfinder) [15:31:40] (03CR) 10Ssingh: [C:03+1] "Looks good, nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez) [15:34:03] (03CR) 10Vgutierrez: [C:03+2] haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez) [15:34:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P72274 and previous config saved to /var/cache/conftool/dbconfig/20250123-153405-marostegui.json [15:34:52] 10ops-codfw, 06SRE, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10489194 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm not seeing any new errors on this machine. gonna close this ticket for now, but if it errors again, feel free to reopen or start... [15:35:06] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:35:06] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:35:08] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.13 refs T382364 [15:35:16] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [15:36:12] (03CR) 10Federico Ceratto: [C:03+1] instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto) [15:36:14] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, db2136.yaml, site.pp: Remove db2136 [puppet] - 10https://gerrit.wikimedia.org/r/1113807 (https://phabricator.wikimedia.org/T384479) (owner: 10Federico Ceratto) [15:36:26] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:36:26] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1142-1147].eqiad.wmnet [15:37:07] (03CR) 10Fabfur: [C:03+1] haproxy,hiera: Deploy unified-goog TLS material [puppet] - 10https://gerrit.wikimedia.org/r/1113818 (https://phabricator.wikimedia.org/T384606) (owner: 10Vgutierrez) [15:37:17] (03PS6) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [15:37:38] (03CR) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:38:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489207 (10VRiley-WMF) [15:38:30] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:40:20] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:40:47] (03CR) 10David Caro: [C:03+1] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [15:41:05] (03CR) 10FNegri: [C:03+2] prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [15:42:32] (03PS7) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [15:43:47] (03CR) 10CI reject: [V:04-1] mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:48:14] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2189 [15:48:24] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:48:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2189 [15:48:30] (03CR) 10Scott French: "Thanks, Effie!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:48:46] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:50:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Removing db2136 T384479', diff saved to https://phabricator.wikimedia.org/P72276 and previous config saved to /var/cache/conftool/dbconfig/20250123-155016-fceratto.json [15:50:21] T384479: decommission db2136.codfw.wmnet - https://phabricator.wikimedia.org/T384479 [15:50:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T384592)', diff saved to https://phabricator.wikimedia.org/P72277 and previous config saved to /var/cache/conftool/dbconfig/20250123-155023-marostegui.json [15:50:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:50:35] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489374 (10Jhancock.wm) @Marostegui db2189 is moved, updated, and pinging! [15:50:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:50:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72278 and previous config saved to /var/cache/conftool/dbconfig/20250123-155045-marostegui.json [15:50:48] federico3: FYI the above alert is what we get if there are changes in dbctl not commited for some time [15:50:54] it will recover now ofc [15:50:56] (03PS1) 10Jgiannelos: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824 [15:51:15] (03PS8) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) [15:51:15] not sure if you had already the chance to see it in action, hence why I'm mentioning it :) [15:51:34] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489378 (10Jhancock.wm) [15:52:17] (03CR) 10Jgiannelos: [C:03+2] kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824 (owner: 10Jgiannelos) [15:53:15] volans: I'm aware - I was looking at dbctl config diff but how is it getting changes from the CR? [15:53:23] (03Merged) 10jenkins-bot: kartotherian: Bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113824 (owner: 10Jgiannelos) [15:53:24] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:53:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489385 (10fnegri) [15:53:46] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:53:48] when you run puppet-merge [15:55:24] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [15:56:07] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [15:56:18] (03CR) 10Effie Mouzeli: mw-on-k8s: update PHPFPMTooBusy to alert per release (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [15:59:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72279 and previous config saved to /var/cache/conftool/dbconfig/20250123-155910-marostegui.json [15:59:15] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:59:16] (03CR) 10Scott French: [C:03+1] "Nice, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1113811 (https://phabricator.wikimedia.org/T384532) (owner: 10Effie Mouzeli) [16:00:04] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1600). [16:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489448 (10fnegri) This is firing again today. [16:07:18] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489449 (10Marostegui) >>! In T383709#10489374, @Jhancock.wm wrote: > @Marostegui db2189 is moved, updated, and... [16:07:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72280 and previous config saved to /var/cache/conftool/dbconfig/20250123-160730-root.json [16:09:23] FIRING: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489466 (10VRiley-WMF) [16:11:16] (03CR) 10Muehlenhoff: [C:03+2] Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:12:19] RESOLVED: ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P72281 and previous config saved to /var/cache/conftool/dbconfig/20250123-161417-marostegui.json [16:15:10] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:42] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489498 (10RobH) a:05RobH→03Vgutierrez @Vgutierrez, >>! In T382026#10489496, @RobH wrote: >> Good afternoon Dear >> The infrastructure team installed a Blanking Panel... [16:16:55] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:57] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10489523 (10ssingh) a:05Vgutierrez→03BCornwall [16:19:39] (03PS1) 10Brouberol: airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833 [16:19:51] (03PS1) 10Subramanya Sastry: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) [16:20:16] (03CR) 10DCausse: [C:03+2] "PS3 only bumps from 0.10.0 (broken) to 0.11.0" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [16:21:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [16:21:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol) [16:21:20] (03PS2) 10Brouberol: airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833 [16:21:40] (03Merged) 10jenkins-bot: eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [16:21:43] (03CR) 10Muehlenhoff: [C:03+2] osm_master: Provide a dummy variable for tilerator on bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1113810 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:22:09] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4859/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol) [16:22:18] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [16:22:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72282 and previous config saved to /var/cache/conftool/dbconfig/20250123-162235-root.json [16:22:42] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2022.codfw.wmnet with reason: remove from cluster for reimage [16:22:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10489533 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=46a6b03e-0964-494b-92f3-40af6ca3beb9) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [16:22:53] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow: disable the wmf_auto_restart services along with airflow [puppet] - 10https://gerrit.wikimedia.org/r/1113833 (owner: 10Brouberol) [16:23:35] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti2022 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113769 (owner: 10Muehlenhoff) [16:23:36] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [16:24:03] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [16:25:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1162 with weight 0 T384597', diff saved to https://phabricator.wikimedia.org/P72283 and previous config saved to /var/cache/conftool/dbconfig/20250123-162552-root.json [16:25:57] T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597 [16:26:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T384597 [16:26:38] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597) [16:26:54] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:27:11] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1113802 (https://phabricator.wikimedia.org/T384597) (owner: 10Gerrit maintenance bot) [16:28:10] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:58] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [16:29:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P72284 and previous config saved to /var/cache/conftool/dbconfig/20250123-162924-marostegui.json [16:29:52] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [16:31:59] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [16:33:10] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:11] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [16:33:24] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: demonstration - bking@cumin2002 - T380752 [16:33:28] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [16:33:50] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:34:41] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:35:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10489643 (10VRiley-WMF) 05Open→03Resolved [16:36:54] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:55] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489651 (10Jhancock.wm) @JMeybohm, what do you think of this schedule for getting these moved? wikikube-worker2... [16:37:11] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [16:37:57] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638#10489656 (10VRiley-WMF) a:03VRiley-WMF [16:37:59] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [16:38:07] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638#10489657 (10VRiley-WMF) 05Open→03Resolved Loose power cable. [16:39:26] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:40:24] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:41:05] (03PS1) 10Elukey: services: bump kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113842 (https://phabricator.wikimedia.org/T384530) [16:42:03] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489669 (10JMeybohm) >>! In T383709#10489651, @Jhancock.wm wrote: > @JMeybohm, what do you think of this schedu... [16:42:51] !log Starting s2 eqiad failover from db1222 to db1162 - T384597 [16:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:55] T384597: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T384597 [16:43:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1162 to s2 primary T384597', diff saved to https://phabricator.wikimedia.org/P72285 and previous config saved to /var/cache/conftool/dbconfig/20250123-164322-root.json [16:44:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1222 T384597', diff saved to https://phabricator.wikimedia.org/P72286 and previous config saved to /var/cache/conftool/dbconfig/20250123-164415-marostegui.json [16:44:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T384592)', diff saved to https://phabricator.wikimedia.org/P72287 and previous config saved to /var/cache/conftool/dbconfig/20250123-164431-marostegui.json [16:44:43] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:44:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:44:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72288 and previous config saved to /var/cache/conftool/dbconfig/20250123-164453-marostegui.json [16:45:42] (03CR) 10CI reject: [V:04-1] For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [16:47:05] (03PS1) 10Marostegui: db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113844 (https://phabricator.wikimedia.org/T382842) [16:47:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1222.eqiad.wmnet with reason: Index rebuild [16:47:37] (03CR) 10Marostegui: [C:03+2] db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113844 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:47:57] (03CR) 10Subramanya Sastry: "recheck" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [16:51:22] (03PS1) 10CDanis: chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) [16:51:41] (03PS1) 10Marostegui: rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) [16:53:00] (03CR) 10Elukey: [C:03+2] services: bump kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113842 (https://phabricator.wikimedia.org/T384530) (owner: 10Elukey) [16:53:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72289 and previous config saved to /var/cache/conftool/dbconfig/20250123-165309-marostegui.json [16:53:14] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:53:56] (03CR) 10Marostegui: "FYI" [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:53:58] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:54:31] (03Merged) 10jenkins-bot: rebuild_tables.sh: STOP and START SLAVE [software] - 10https://gerrit.wikimedia.org/r/1113847 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:54:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:26] (03PS11) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [16:59:30] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:59:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:59:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:05] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:15] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:05:31] (03CR) 10CDanis: [C:04-2] "to be deployed only after 1.44.0-wmf.13 is live on group2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) (owner: 10CDanis) [17:08:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P72290 and previous config saved to /var/cache/conftool/dbconfig/20250123-170816-marostegui.json [17:09:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489797 (10fnegri) Moving it out of #wmcs-hardware and back to #cloud-services-team because otherwise @phaultfinder keeps on creating new tasks for this alert. [17:09:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10489802 (10fnegri) [17:15:33] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489809 (10Eevans) cassandra-dev2001 can be moved at your leisure (no coordination is needed). [17:15:40] (03CR) 10Andrea Denisse: [C:03+2] librenms: Ensure the cache/data directory belongs to librenms [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) (owner: 10Andrea Denisse) [17:16:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10489811 (10phaultfinder) [17:18:33] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10489812 (10Jhancock.wm) [17:20:10] !log power down cassandra-dev2001 for maintenance [17:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10489846 (10elukey) ` >>> pprint(r.request("get", "/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7").json()) {'@odata... [17:23:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P72292 and previous config saved to /var/cache/conftool/dbconfig/20250123-172323-marostegui.json [17:26:32] (03PS12) 10JMeybohm: Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [17:26:32] (03PS3) 10JMeybohm: Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) [17:26:33] (03PS4) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) [17:28:58] (03PS1) 10Federico Ceratto: instances.yaml: Remove db2140 [puppet] - 10https://gerrit.wikimedia.org/r/1113849 (https://phabricator.wikimedia.org/T384480) [17:33:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2001 [17:33:12] (03CR) 10JMeybohm: Update staging-codfw to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:33:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2001 [17:33:18] (03PS1) 10Andrea Denisse: librenms: Fix path to the cache/data directory [puppet] - 10https://gerrit.wikimedia.org/r/1113850 (https://phabricator.wikimedia.org/T384440) [17:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T384592)', diff saved to https://phabricator.wikimedia.org/P72293 and previous config saved to /var/cache/conftool/dbconfig/20250123-173830-marostegui.json [17:38:35] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:38:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1187.eqiad.wmnet with reason: Maintenance [17:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72294 and previous config saved to /var/cache/conftool/dbconfig/20250123-173852-marostegui.json [17:46:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72295 and previous config saved to /var/cache/conftool/dbconfig/20250123-174641-marostegui.json [17:46:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:49:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1800) [18:01:06] Nothing for me to ship today jouncebot. Thanks for the reminder though. You are a good bot. :) [18:01:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P72296 and previous config saved to /var/cache/conftool/dbconfig/20250123-180148-marostegui.json [18:05:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:08:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:08:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:29] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1142-1147].eqiad.wmnet [18:09:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1142-1147].eqiad.wmnet [18:12:36] (03PS1) 10Cathal Mooney: Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) [18:15:02] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney) [18:16:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P72297 and previous config saved to /var/cache/conftool/dbconfig/20250123-181655-marostegui.json [18:17:20] (03PS2) 10Cathal Mooney: Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) [18:19:04] (03PS1) 10Andrew Bogott: dsh: remove librenms group entirely [puppet] - 10https://gerrit.wikimedia.org/r/1113855 (https://phabricator.wikimedia.org/T380679) [18:19:37] (03CR) 10Andrew Bogott: [C:03+2] dsh: remove librenms group entirely [puppet] - 10https://gerrit.wikimedia.org/r/1113855 (https://phabricator.wikimedia.org/T380679) (owner: 10Andrew Bogott) [18:20:07] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney) [18:25:57] (03PS1) 10CDanis: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) [18:27:14] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney) [18:31:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis) [18:31:48] (03CR) 10Cathal Mooney: [C:03+2] Add gnmic collection for network POPs [puppet] - 10https://gerrit.wikimedia.org/r/1113853 (https://phabricator.wikimedia.org/T384345) (owner: 10Cathal Mooney) [18:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T384592)', diff saved to https://phabricator.wikimedia.org/P72298 and previous config saved to /var/cache/conftool/dbconfig/20250123-183202-marostegui.json [18:32:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:32:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:37:30] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:21] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1112248 (https://phabricator.wikimedia.org/T383900) (owner: 10Filippo Giunchedi) [18:42:30] RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [18:43:36] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqord [18:43:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqord [18:44:21] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cr2-eqdfw [18:44:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqdfw [18:49:11] (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [18:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:53:04] (03PS3) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver [puppet] - 10https://gerrit.wikimedia.org/r/1113594 [18:53:34] 06SRE, 06Infrastructure-Foundations, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440#10490214 (10andrea.denisse) [18:53:55] 06SRE, 06Infrastructure-Foundations, 10observability, 10SRE Observability (FY2024/2025-Q3): LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440#10490215 (10andrea.denisse) 05Open→03Resolved [18:53:58] (03CR) 10Dzahn: alertmanager: add missing route for sre-collab-releng receiver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113594 (owner: 10Dzahn) [19:00:05] brennen and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T1900) [19:01:04] hello. [19:02:53] (03CR) 10Majavah: [C:03+1] C:netbox: Allow NDA group to access Netbox. [puppet] - 10https://gerrit.wikimedia.org/r/1070563 (https://phabricator.wikimedia.org/T373702) (owner: 10Slyngshede) [19:04:52] !log fix my netbox account T373702 [19:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] T373702: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702 [19:08:12] (03PS2) 10Raymond Ndibe: [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) [19:11:26] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:14:01] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:14:42] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:15:52] !log 1.44.0-wmf.13 train (T382364): no current blockers, logs relatively clean, rolling to all wikis. [19:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:56] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [19:16:10] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364) [19:16:12] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [19:16:22] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:16:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10490318 (10phaultfinder) [19:16:57] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113862 (https://phabricator.wikimedia.org/T382364) (owner: 10TrainBranchBot) [19:18:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1231.eqiad.wmnet with reason: Maintenance [19:18:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72299 and previous config saved to /var/cache/conftool/dbconfig/20250123-191808-marostegui.json [19:18:13] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:18:18] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clou~dgw1004 - vriley@cumin1002" [19:18:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clou~dgw1004 - vriley@cumin1002" [19:18:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:16] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:19:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:19:34] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:21:20] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:22:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:22:48] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudgw1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:24:51] (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1113212 (owner: 10BCornwall) [19:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72300 and previous config saved to /var/cache/conftool/dbconfig/20250123-192517-marostegui.json [19:25:22] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:26:59] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T384645 (10phaultfinder) 03NEW [19:33:10] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.13 refs T382364 [19:33:15] T382364: 1.44.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T382364 [19:37:25] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10490386 (10Jhancock.wm) [19:38:09] brennen: things looking good? [19:40:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P72301 and previous config saved to /var/cache/conftool/dbconfig/20250123-194024-marostegui.json [19:41:21] cdanis: yeah, pretty chill [19:41:37] 06SRE, 06Infrastructure-Foundations, 10netops: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10490388 (10cmooney) 05Open→03Resolved a:03cmooney This is working now {F58260515 width=700} [19:43:31] brennen: cool, any objections to me sneaking in my backport right now? [19:43:57] go right ahead [19:44:17] (03PS2) 10Jforrester: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis) [19:44:22] there're one or two little things i'm keeping an eye on, but nothing throwing a high rate of errors. [19:44:37] (03CR) 10Jforrester: "(Re-cherry-picked merely to inject the -x hash attribution.)" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis) [19:46:53] James_F: was that just a description edit? [19:47:02] cccccbefecbvtkvnknlggcfdhgheuguuvihrueultngv [19:47:06] sigh [19:47:12] cdanis: Yup, and hello to your YubiKey too. [19:47:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis) [19:55:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P72302 and previous config saved to /var/cache/conftool/dbconfig/20250123-195531-marostegui.json [20:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:13] (03Merged) 10jenkins-bot: tracing: lowercase headers before processing them [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113856 (https://phabricator.wikimedia.org/T384629) (owner: 10CDanis) [20:07:28] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]] [20:07:33] T384629: Mediawiki OTel exports broken as of wmf.12 release - https://phabricator.wikimedia.org/T384629 [20:10:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T384592)', diff saved to https://phabricator.wikimedia.org/P72303 and previous config saved to /var/cache/conftool/dbconfig/20250123-201038-marostegui.json [20:10:43] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:10:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [20:11:45] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:12:21] !log cdanis@deploy2002 cdanis: Continuing with sync [20:19:53] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113856|tracing: lowercase headers before processing them (T384629)]] (duration: 12m 25s) [20:20:02] hooray [20:27:20] (03PS1) 10Raymond Ndibe: [toolforge::harbor] upgrade harbor v2.10.1 ---> v2.12.2 [puppet] - 10https://gerrit.wikimedia.org/r/1113871 (https://phabricator.wikimedia.org/T358225) [20:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10490485 (10phaultfinder) [20:33:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2151.codfw.wmnet with reason: Maintenance [20:42:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72304 and previous config saved to /var/cache/conftool/dbconfig/20250123-204245-marostegui.json [20:42:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:46:38] (03PS2) 10CDanis: chart-renderer: new release (now w/ ECS) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113846 (https://phabricator.wikimedia.org/T383748) [20:57:35] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Neslihan Turan - WMDE - https://phabricator.wikimedia.org/T384017#10490551 (10KFrancis) I'm processing the NDA now. I'll confirm when it's complete. Thanks! [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T2100). [21:00:05] cscott and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:19] i'm here! [21:00:35] o/ [21:00:39] i can deploy [21:00:54] unless cscott you'd like to self-deploy? [21:01:52] no, i appreciate the help [21:02:05] i'd rather you deploy, i'm very rusty [21:02:11] np! [21:03:13] cscott: should this be rebased on 1.44.0-wmf.13 [21:03:24] ? [21:03:43] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1113834 is on the wmf.13 branch I think? [21:03:52] maybe i put the wrong patch on the deploy calendar? [21:04:49] i think that's right - i just usually try to rebase patches before scap backporting them - i think it's safe to rebase on top of wmf.13 [21:05:28] yeah, should be safe to rebase. we just cherry-picked this this morning, but i guess maybe other patches landed on wmf.13 since then. [21:05:38] (03PS2) 10Subramanya Sastry: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) [21:06:12] o/ [21:06:57] cjming: yeah looks like some changes to /libs/telemetry landed on .13 but they should be completely independent of our parser patch. [21:07:23] cscott: 18 mins for CI and i see it failing [21:07:45] (03PS1) 10Cathal Mooney: Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874 [21:08:23] (03CR) 10Cathal Mooney: [C:03+2] Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874 (owner: 10Cathal Mooney) [21:08:57] (03Merged) 10jenkins-bot: Force VFUK learnt routes over alternate transit in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1113874 (owner: 10Cathal Mooney) [21:10:09] cscott: this is just the rebase [21:12:00] 06SRE, 06Infrastructure-Foundations, 10netops: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10490568 (10cmooney) 05Open→03Resolved Gonna close this one for now, the balance is better with the changes we added and we can review as time goes on. [21:13:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72305 and previous config saved to /var/cache/conftool/dbconfig/20250123-211306-marostegui.json [21:13:11] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:14:17] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [21:14:22] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b39f587-684b-42ab-a96c-cf552c03a29d) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [21:15:00] cscott: not sure how to proceed - presumably rebase will not pass CI - are you ok with me deploying the next patch in the queue while things get sorted out with your patch? [21:16:23] it appears to be a transient failure in API tests in CI. go ahead and deploy the next patch in the queue, i'll see if I can kick CI. [21:16:47] cool thanks [21:16:58] hi cwhite: i'll do your config patch now [21:17:10] ^ failure appears to have nothing to do with our patch, just a race condition. https://www.irccloud.com/pastebin/wYoKpq5X/ [21:17:43] Thank you! [21:18:15] (03PS6) 10Krinkle: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [21:18:33] (03CR) 10CI reject: [V:04-1] For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [21:18:57] (03CR) 10C. Scott Ananian: "recheck" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [21:19:38] on a /completely/ unrelated note, where can i formally submit a request that we have an OKR for reducing/eliminating spurious CI failures? [21:19:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [21:20:23] ++ to that OKR [21:20:30] (03Merged) 10jenkins-bot: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [21:20:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:20:46] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]] [21:22:30] recheck looks successful, fingers crossed selenium doesn't crap out [21:24:25] cwhite: on test servers if verifiable - lmk if/when to sync [21:25:01] checking [21:25:08] !log cjming@deploy2002 cwhite, cjming: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P72306 and previous config saved to /var/cache/conftool/dbconfig/20250123-212813-marostegui.json [21:29:12] cjming: looks good to me, please feel free to continue [21:29:26] coo [21:29:28] cool [21:29:32] !log cjming@deploy2002 cwhite, cjming: Continuing with sync [21:33:09] 06SRE, 06Infrastructure-Foundations, 10netops: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10490638 (10cmooney) [21:35:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:36:14] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081460|Profiler: centralize metrics send to a function]] (duration: 15m 28s) [21:36:17] cwhite: should be live :) [21:37:02] cscott: looking good - it'll be another 18+ mins to merge it [21:37:11] Thank you! [21:37:34] cjming: it's in post-build script now [21:38:24] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490654 (10cmooney) Fwiw I thought I saw a potential optimisation to allow us to go back to the "on change" style subscription. gNMIc has a parameter that can be configu... [21:38:28] Finished: SUCCESS [21:38:33] yay! [21:38:58] (and boo for spurious CI failures and long CI times, but... sigh) [21:39:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [21:43:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P72307 and previous config saved to /var/cache/conftool/dbconfig/20250123-214320-marostegui.json [21:57:22] (03Merged) 10jenkins-bot: For Parsoid calls, treat preprocessing as starting in SOL state [core] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113834 (https://phabricator.wikimedia.org/T382464) (owner: 10Subramanya Sastry) [21:57:37] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]] [21:57:42] T382464: Parsoid's list parsing seems to ingore one leading newline in templates causing rendering differences - https://phabricator.wikimedia.org/T382464 [21:58:24] cjming: looks like it merged. i'm here to test canaries. [21:58:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T384592)', diff saved to https://phabricator.wikimedia.org/P72308 and previous config saved to /var/cache/conftool/dbconfig/20250123-215828-marostegui.json [21:58:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2158.codfw.wmnet with reason: Maintenance [21:58:33] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [21:58:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:58:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72309 and previous config saved to /var/cache/conftool/dbconfig/20250123-215855-marostegui.json [21:59:14] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [21:59:21] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f0feb1a-6c73-4906-bb5a-2df62eb7e156) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250123T2200) [22:01:08] https://en.wikipedia.org/wiki/User:Cscott/T382464 is the smoke test for this patch [22:01:22] cscott: on test servers for verifying [22:01:47] testing [22:02:06] !log cjming@deploy2002 ssastry, cjming: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:02:16] cjming: looks good [22:02:24] great - syncing [22:02:28] !log cjming@deploy2002 ssastry, cjming: Continuing with sync [22:03:10] cscott, confirmed .. it looks good. [22:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:09:22] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113834|For Parsoid calls, treat preprocessing as starting in SOL state (T382464)]] (duration: 11m 45s) [22:09:27] T382464: Parsoid's list parsing seems to ignore one leading newline in templates causing rendering differences - https://phabricator.wikimedia.org/T382464 [22:09:44] cscott: should be live! [22:10:24] !log end of UTC late backport window [22:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:10:54] cjming, thanks! [22:11:06] subbu: yw! [22:11:42] PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10476 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [22:11:48] PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10480 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [22:12:51] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad [22:13:09] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad [22:20:42] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:42] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 20.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:30:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72310 and previous config saved to /var/cache/conftool/dbconfig/20250123-223057-marostegui.json [22:31:02] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 19.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:46:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P72311 and previous config saved to /var/cache/conftool/dbconfig/20250123-224604-marostegui.json [22:51:21] cjming: thanks, as always, for handling backports [22:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P72312 and previous config saved to /var/cache/conftool/dbconfig/20250123-230112-marostegui.json [23:06:02] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490768 (10cmooney) The current configuration we have requires us to enable [[ https://gnmic.openconfig.net/user_guide/caching/ | gnmic caching ]], as we group certain me... [23:11:14] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490804 (10cmooney) FWIW I used the config from P72314 in the most recent tests. I'd tried to use some of the advice from [[ https://github.com/openconfig/gnmic/issues/4... [23:12:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T384592)', diff saved to https://phabricator.wikimedia.org/P72315 and previous config saved to /var/cache/conftool/dbconfig/20250123-231619-marostegui.json [23:16:25] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:16:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72316 and previous config saved to /var/cache/conftool/dbconfig/20250123-231641-marostegui.json [23:17:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:46:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T384592)', diff saved to https://phabricator.wikimedia.org/P72317 and previous config saved to /var/cache/conftool/dbconfig/20250123-234653-marostegui.json [23:46:59] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [23:48:50] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10490884 (10BCornwall) Hi, @RobH, thanks for doing this! At first glance, nothing's improved. The inlet temps are acceptable at ~20° yet the CPUs are still hitting ~90°. O...