[00:13:59] (SystemdUnitFailed) firing: remove_old_puppet_reports.service Failed on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:59] (SystemdUnitFailed) firing: remove_old_puppet_reports.service Failed on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:03] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Please open a new task for that. There is already a [[ https://github.com/wikimedia/operations-cookbooks/blob/ma... [07:03:31] FYI, rebooting the netboxdb hosts in a few [07:10:02] (SystemdUnitFailed) firing: (2) ifup@ens13.service Failed on netboxdb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:59] (SystemdUnitFailed) firing: (3) ifup@ens13.service Failed on netboxdb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:58] ck [07:42:02] *ack [07:42:30] moritzm: did the above alerts recovered? [07:46:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions... [07:51:58] yeah, there's a reconciliation systemd timer which resolves these [07:56:14] ok thx [08:21:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) @cmooney @ayounsi thanks a lot! On the host side, I'd try two things (not sure if they could help or not): 1) Do a simple reboot. The hosts... [08:27:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) p:05Triage→03Low [08:40:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) Hi # TL;DR cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bum... [08:42:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel ver... [08:45:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) There's a few more actionables here: 1. Re-evaluate our SLO target for conf hosts etcd service. Despite having exhausted the error budget... [08:46:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) [08:53:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10fgiunchedi) >>! In T345738#9148513, @akosiaris wrote: > Hi > > # TL;DR > > cadvisor is to blame. Adding @fgiunchedi for his information and a thumb... [08:59:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> cad... [08:59:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> ca... [09:10:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) p:05Triage→03Low [11:18:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) p:05Triage→03Medium Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled pup... [11:19:00] (SystemdUnitFailed) firing: remove_old_puppet_reports.service Failed on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:00] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:00] (SystemdUnitFailed) firing: (4) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:33] (SystemdUnitFailed) firing: (4) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:33] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: reimage cookbook should exit cleanly if no puppet role is applied to a node - https://phabricator.wikimedia.org/T338990 (10bking) Is there a way to send the puppet run output to the cookbook logs on cumin? I assume that if `install-console` can login, there's... [14:59:27] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10Volans) Sure why not we can add an attempted alert, but of course would be based on some hostname matching, not super reliable. Fee... [15:02:31] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10Volans) That's indeed something we might want to look at going forward. The only blocker I see right now is that most of the "prod... [15:16:37] 10netbox, 10Infrastructure-Foundations: Markdown bug in Netbox-next - https://phabricator.wikimedia.org/T340444 (10Volans) Does it happen on -next only? Is this still happening? Has anyone looked into it? [15:22:29] 10netbox, 10Infrastructure-Foundations: Markdown bug in Netbox-next - https://phabricator.wikimedia.org/T340444 (10ayounsi) Probably not, probably, probably not. :) [15:28:33] (SystemdUnitFailed) firing: (3) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:30] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) True, `sre.hosts.reimage` is not likely to work anytime soon. The only `sre.*` cookbooks that I think we can easily run fr... [15:52:05] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10Volans) Regarding keeping only the wmcs cookbooks in the conifg that's ok for me if it's ok for the WMCS team. At least for now, s... [15:56:58] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) Ok, I will create a patch! [16:00:12] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10colewhite) Ran into this today trying to `pip install wikimedia-spicerack` (Python 3.11). Worked around it with `pip install "pyyaml<5"... [16:00:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) We will be using to test the new codfw spine/leaf new design contint2001 and thumbor2004. contint2001 will be rename to sretest... [16:19:52] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) @colewhite from your comment on on the [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/955717 | elastic search restart... [16:41:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) Thanks @Papaul ! [16:51:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10colewhite) >>! In T345337#9150415, @jbond wrote: > @colewhite from your comment on on the [[ https://gerrit.wikimedia.org/r/c/operations/... [19:28:33] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:33] (SystemdUnitFailed) firing: (2) remove_old_puppet_reports.service Failed on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed