[01:08:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:26] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:31] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:09] I'm briefly reimaging sretest1002 to test the bookworm reimage issue [07:53:25] FYI I've updated bookworm's installer as 12.1 was released 4 days ago [08:11:43] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) I can't reproduce this with bullseye, the reimage works fine with it. I tried to reproduce it with bookworm on `sretest1002` but I got an unrelated error... [08:11:44] although it doesn't work... I was trying to check the reimage issue, but apparently bookworm installations are broken: https://phabricator.wikimedia.org/T342345#9043516 [08:11:48] ^^^ [08:12:04] and I can't dig into that now [08:59:24] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:17] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Fabfur) The same error goes on lvs1016 now, but didn't on previous hosts. Confirm that we're not running custom kernel (AFAIK)... [13:58:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:38] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) For `vrrp_peer` instead of doing some costly/complicated query from Netbox, I'm wondering if we could/should do it... [15:43:59] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Add network devices fingerprints to known_hosts - https://phabricator.wikimedia.org/T327643 (10ayounsi) [15:44:04] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [15:44:08] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) [16:04:50] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:07:07] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team: Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [16:07:11] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:09:18] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [16:16:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Upgrade new codfw switches to Juniper recommended - https://phabricator.wikimedia.org/T341670 (10cmooney) [16:16:23] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [17:15:16] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10thcipriani) [17:27:35] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [17:29:11] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [17:30:10] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) [17:46:53] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) [17:58:26] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.94% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:59:25] (SystemdUnitFailed) firing: (2) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:00:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.508% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace