[00:07:25] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:34] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:34] FIRING: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:32:29] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:34] RESOLVED: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:37:25] FIRING: [4x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:25] FIRING: [4x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:53] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10367177 (10MoritzMuehlenhoff) The underlying failing check is defined in the headers, but not otherwise... [09:39:34] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10367186 (10MoritzMuehlenhoff) But with perccli the battery is reported to be fine (command is /opt/MegaR... [09:40:51] 10SRE-tools, 06SRE, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10367189 (10Gehel) 05Resolved→03Declined [11:42:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:25] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:28] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, and 3 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10368116 (10Arnoldokoth) a:05eoghan→03Arnoldokoth [15:11:54] moritzm: you've been busy setting up new ganeti nodes in eqiad is that right? [15:12:28] the new nodes in eqiad are all racked and active [15:12:33] yeah no issues [15:12:41] what's left is just cleaning out a few more old servers [15:12:44] I noticed something slightly imperfect in the process [15:12:51] what is it? [15:13:17] TL;DR when they are reimagined, the cookbook runs the "puppetdb interfaces import" at the end [15:13:44] but after this we set up the bridges on the hosts [15:13:48] love the reimagined typo :-) [15:13:56] hahaha yes [15:14:06] something needs to be reimagined - like how we set up network interfaces :P [15:14:12] anyway it's no big deal [15:14:31] Homer, when working out what VMs need BGP, is getting it wrong for the new hosts [15:14:39] there's is one more ganeti node in codfw left (has a broken CPU and Supermicro needs to fix it) [15:14:41] cos it's looking for "private" and "public" bridge devices, which aren't there [15:15:03] we can use it to test tweaks for the process [15:15:42] ok cool [15:15:54] I think a netbox fix should ensure we don't hit this [15:16:17] but either way refining the process to do another import after it's in its "final state" is probably a good idea [15:18:38] *homer fix [15:18:47] sounds good to me, I'm afraid fixing the CPU of the broken node will take at least two more weeks, could you open a task for this so that we don't forget? [15:19:03] yeah will do [15:19:18] cheers [15:37:17] 10netops, 06Infrastructure-Foundations, 06SRE: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175 (10cmooney) 03NEW p:05Triage→03High [15:40:42] 10netops, 06Infrastructure-Foundations, 06SRE: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368381 (10cmooney) [15:56:26] 10netops, 06Infrastructure-Foundations, 06SRE: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368397 (10cmooney) [16:12:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:14] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10368561 (10cmooney) The above patch determines what devices need to peer with CRs based on vlan membership (and the vlan nami... [17:31:37] 10netops, 06Infrastructure-Foundations, 06SRE: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#10368575 (10cmooney) [17:33:54] 10netops, 06Infrastructure-Foundations, 06SRE: Change codfw dns hosts BGP peering to top-of-rack switch - https://phabricator.wikimedia.org/T376894#10368581 (10cmooney) Just a note that with the changes made under T381175 we are now creating the list of devices CRs need to peer with based on vlan membership.... [20:12:25] FIRING: [2x] SystemdUnitFailed: logrotate.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed