[01:56:17] (NodeTextfileStale) firing: Stale textfile for puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:03:47] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:02] (NodeTextfileStale) firing: (3) Stale textfile for puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:03:47] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:02] (NodeTextfileStale) firing: (4) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:26:02] (NodeTextfileStale) firing: (5) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:28:37] 10SRE-tools, 10Data-Persistence, 10Spicerack, 10Traffic, 10serviceops: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10Joe) [09:24:41] (SystemdUnitFailed) resolved: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:41] (NodeTextfileStale) firing: (6) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:49:41] (NodeTextfileStale) firing: (7) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:58:11] (NodeTextfileStale) firing: (8) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:08:11] (NodeTextfileStale) firing: (9) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:18:11] (NodeTextfileStale) firing: (10) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:29:41] (NodeTextfileStale) firing: (11) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:33:11] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on ganeti3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:11] (NodeTextfileStale) firing: (13) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:33:11] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on ganeti3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:11] (NodeTextfileStale) firing: (13) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:39:41] (NodeTextfileStale) resolved: (13) Stale textfile for puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:28:00] 10netops, 10Infrastructure-Foundations, 10SRE: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) p:05Triage→03Medium [13:13:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10ayounsi) For cross-sites router to router we use the TTL value to eventually take down the session if the BGP session takes a too long path, it's cl... [13:47:56] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:48:57] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:55:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) a:03Jclark-ctr [14:00:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) Switch has been removed from config and powered off. All yours to do the remaining steps. I think https://netbox.wikimedia.org/dcim/cables/5708/ are 40G optics,... [14:00:32] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:04:00] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10ayounsi) [14:04:05] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) 05Resolved→03Open When working on something else I noticed that those were still in Netbox: htt... [14:04:41] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) [14:09:41] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:00] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:19:41] (SystemdUnitFailed) firing: (2) prometheus_puppet_agent_stats.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:11] (SystemdUnitFailed) resolved: (2) prometheus_puppet_agent_stats.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved >>! In T350488#9308379, @ayounsi wrote: > For cross-sites router to router we use the TTL value to eventually take do... [14:27:16] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:27:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Resolved→03Open Eh not sure how I accidentally set this to resolved! [14:29:56] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) Thanks @jcrespo, I have just installed the new packages in our dev cluster (`cloudservices200[45]-dev.codfw.wmnet`) and I will inst... [14:30:06] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) 05In progress→03Resolved [14:33:11] (SystemdUnitFailed) firing: (2) prometheus_puppet_agent_stats.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:09] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10SRE, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) [14:49:03] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10SRE, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) @Volans FYI ill update the d3ecomuission cookbook as part of this task, thanks for the pointer [14:53:11] (SystemdUnitFailed) resolved: prometheus_puppet_agent_stats.service Failed on ganeti1028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:17] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:58:47] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:33:19] 10CFSSL-PKI, 10Infrastructure-Foundations: Investigate SCEP proxy options - https://phabricator.wikimedia.org/T340193 (10jbond) 05Open→03Declined [15:41:30] 10CFSSL-PKI, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [16:04:19] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [16:04:29] 10Puppet, 10Infrastructure-Foundations, 10Puppet CI, 10SRE, and 2 others: update pcc with puppet 7 support - https://phabricator.wikimedia.org/T236373 (10jbond) 05Open→03Resolved a:03jbond This is done [16:32:31] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet CI, 10SRE, 10Continuous-Integration-Config: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10Volans) [16:34:36] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Epic, and 2 others: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645 (10jbond) [17:03:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10Jclark-ctr) a:03VRiley-WMF [17:28:11] (SystemdUnitFailed) firing: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:11] (SystemdUnitFailed) resolved: check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:54] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10VRiley-WMF) Ran the cable and plugged it into requested ports. [19:23:37] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10cmooney) Thanks @VRiley-WMF. Right now we can't see the status as the port needs to be enabled for 100G. But that involves resetting PIC 0/1 completely which wil... [20:02:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown