[00:05:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 4.741% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:23:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:33] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:59] (PuppetFailure) firing: Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:06:16] (NodeTextfileStale) firing: Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:07:15] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:24:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:59] (PuppetFailure) firing: (2) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:28:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:33] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:59] (PuppetFailure) firing: (3) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:39:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:43:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:59] (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:50:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:20:59] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:29:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:16] (NodeTextfileStale) firing: Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:07:16] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:54:33] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:33] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:24] <_joe_> hi [08:14:41] <_joe_> all of puppet and debmonitor are broken because of a change that broke the PKI system [08:15:03] <_joe_> cna someone from this team, responsible for the systems in question, please come and help trying to resolve it? [08:15:20] <_joe_> we're discussing matters in #sre, thanks. [08:19:51] 10CFSSL-PKI, 10Infrastructure-Foundations: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet. - https://phabricator.wikimedia.org/T350111 (10Joe) [08:20:49] 10CFSSL-PKI, 10Infrastructure-Foundations: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet. - https://phabricator.wikimedia.org/T350111 (10Joe) p:05Triage→03Unbreak! [08:20:59] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:33:59] (PuppetFailure) firing: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:38:46] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:32] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:47] (SystemdUnitFailed) firing: (45) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:12] (PuppetFailure) resolved: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:01:20] (PuppetFailure) firing: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:03:59] (PuppetFailure) resolved: Puppet has failed on krb2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:05:59] (PuppetFailure) resolved: (4) Puppet has failed on ganeti1029:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:38:26] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [09:38:34] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [09:54:28] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:00:47] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) The last successful sign in eqiad was at 2023-10-30T21:19:14 and in codfw at 2023-10-30T23:04:02 [10:03:31] 10CFSSL-PKI, 10Infrastructure-Foundations: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet. - https://phabricator.wikimedia.org/T350111 (10jbond) [10:03:42] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:04:06] 10CFSSL-PKI, 10Infrastructure-Foundations: PKI system is unable to serve new certificates to debmonitor / other systems, causing puppet failures across the fleet. - https://phabricator.wikimedia.org/T350111 (10jbond) The immediate incident has been resolved ill complete the investigation in T350118 [10:06:16] (NodeTextfileStale) firing: Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:07:16] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:11:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:16:59] (PuppetFailure) firing: Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:28:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:32] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:57] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) It seems apache reloads at 00:00 every night. i believe this is what caused the issue. the pki certificates where rotated to puppet7 at 17... [10:34:38] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [10:34:41] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05Open→03In progress p:05Triage→03Medium [10:38:49] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [10:43:38] 10CAS-SSO, 10Infrastructure-Foundations, 10Release-Engineering-Team (Social Piranhas 🐟): Correct IDP Privacy Policy - https://phabricator.wikimedia.org/T350129 (10Aklapper) p:05Triage→03Low [10:49:10] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Puppet (Puppet 7.0): Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) [11:24:36] 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Social Piranhas 🐟): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129 (10Peachey88) [11:31:59] (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:38:16] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10taavi) [11:48:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:32] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:33] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:59] (PuppetFailure) firing: (2) Puppet has failed on pki1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:08:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:32] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:59] (PuppetFailure) resolved: Puppet has failed on pki2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:44:32] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:32] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:46] (SystemdUnitFailed) firing: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:16] (NodeTextfileStale) firing: Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:07:16] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:08:46] (SystemdUnitFailed) resolved: (3) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) @cmooney thanks for the summary, couple of questions: 1) will the migration be performed rack by rack as opposed to s... [14:19:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) @Volans to get the the prefix ge vs xe maybe use the rack. In codfw we ahve only 10g servers racked in 10g rack and th... [14:21:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > will the migration be performed rack by rack as opposed to server by server? yep > For multi-unit servers we pick... [14:23:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Papaul) yes we always pick the lower numbering unit for 2U host. [14:36:36] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) [14:36:58] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [14:37:06] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) [14:57:22] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) >>! In T327938#9234691, @Volans wrote: > @cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makev... [15:15:03] 10SRE-tools, 10Infrastructure-Foundations: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10ayounsi) There will be special usecase, but if we can tackle all the regular servers (eg. 1 uplink, 1 IP, 1 , then we will be in a great spot. The ideal/cleanest is to go through a re... [15:56:01] (NodeTextfileStale) firing: (2) Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:07:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10Volans) >>! In T348129#9295072, @ayounsi wrote: >> this way there is no check to ensure that reality corresponds to what we do... [16:27:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) > What I mean is that this way it might be harder to catch mistakes, if a host has been plugged into a different port... [17:16:00] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [17:16:18] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Investigate PKI errors - https://phabricator.wikimedia.org/T350118 (10jbond) 05In progress→03Resolved a:03jbond This is fixed now [18:05:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create automation to move servers in Netbox from old to new switch - https://phabricator.wikimedia.org/T348129 (10ayounsi) It's live on netbox-next: https://netbox-next.wikimedia.org/extras/scripts/move_server.MoveServersUplinks/ See that... [18:07:16] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:16] (NodeTextfileStale) firing: (2) Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:07:16] (NodeTextfileStale) firing: (10) Stale textfile for cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:27:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.946% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:56:16] (NodeTextfileStale) firing: (2) Stale textfile for build2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale