[00:04:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.439% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:42:57] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:08] FYI I'm re-imaging sretest1001 a couple of times to test a change in the reimage cookbook [06:48:32] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [06:48:35] likewise, I'm also reimaging sretest1002 ATM [06:48:50] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - **The rei... [06:49:29] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [06:49:55] ack, lmk if you encounter any issue :) [06:52:43] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:49] I've restarted ferm on aux-k8s-ctrl1001, it had failed to start after the reboot, it's a rare race that happens occasionally on initial boot when it fails to resolve a DNS record, but networking isn't fully up [06:54:11] we had previously patched ferm.service to Wants/After nss-loopup.service, but there's still a tiny racy window in ifupdown startup we could never really pin down, but it only happens in like 1 out of 500 reboots and the switch to nftables will make it moot going forward [06:54:47] ack [07:11:56] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Downtimed... [07:22:48] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [07:51:20] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [07:59:39] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [08:18:16] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [08:28:06] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [08:43:33] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [09:42:43] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:10] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [10:23:19] gotta love puppet, you kill it and it exits with 0 exit code [10:23:24] (the agent ofc) [10:23:58] kill -9 works though :D [10:27:53] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [10:42:43] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:19] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) 05Open→03Resolved The above patch has been merged and tested, it now will output:... [12:32:43] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:43] (SystemdUnitFailed) firing: (2) ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=744d6bf2-4472-4a4c-b0a2-ebf0e4e9d466) set by cmooney@cu... [13:16:56] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Upgraded to 22.2R3.15, which is now the recommended version for this platform, hoping it might make some difference, but the issue pers... [13:20:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) 05Open→03Declined I'm going to close this task for now. We should have sufficient IPs from the RIPE waiting list fr... [13:47:24] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Looking further at the logs I honed in on this message: ` Mar 28 09:28:53 cloudsw1-b1-codfw sshd[11344]: subsystem request for netconf... [14:37:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.94% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:34:51] 10netops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10LSobanski) [15:57:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.208% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:37:43] (SystemdUnitFailed) firing: ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:44] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Double checking the only config that seems to be needed to allow Homer to commit is: ` system { services { netconf {... [16:42:43] (SystemdUnitFailed) resolved: ferm.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed