[00:03:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 4.914% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:53:31] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:31] (SystemdUnitFailed) firing: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:57] (SystemdUnitFailed) firing: (2) nagios-nrpe-server.service Failed on bast5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:31] (SystemdUnitFailed) firing: (2) nagios-nrpe-server.service Failed on bast5003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:12] (SystemdUnitFailed) resolved: krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:58] 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10Volans) [09:06:32] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) [09:27:28] with the recent patches to migrated to the signed-by apt key handling, bookworm reimages are fully working now, I could successfully reimage sretest1002 this morning [09:29:14] the only remaining oddity is that modules/ssh/templates/publish_fingerprints/known_hosts.epp fails on puppetmaster1001, timing-wise it matches with the sretest1002, possibly some fact expected by the template changed in bookworm's OS stack or so [09:38:25] nice! and ack [09:45:46] great news moritzm ! [09:56:39] if any more systemd expert could have a second check of this change for a netbox service would be great: [09:56:50] (see also the link on line 2, is what upstream uses) [09:56:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/915486 [09:58:34] I'll have a look in ~ 10m [09:59:40] thx! [10:02:13] awesome! [10:50:50] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) FYI I updated the wikitech page relating to this to update it (when working on T335350). https://wikitech.wikimed... [11:00:50] JFTR, I don't think anyone currently uses it, but just in case: I'm using cuminunpriv1001 as a testbed for the new bullseye Kerberos KDCs, so various things may be off for the next hours [11:01:57] k [11:03:32] XioNoX: I upated my patch of the interface_automation.py script to delete interfaces with IPs if the IPs were SLAAC ones [11:04:06] but I'm kind of thinking that's overkill to have in there, probably we should just run a simple script to delete any SLAAC addresses that are in Netbox and be done with it? [11:04:28] I believe there may be some, from before the import script checked and skipped importing them [11:04:38] (but it does that, so we should have no additional ones added) [11:06:08] yeah if we don't import them anymore, a one off query (and even manual delete) might be enough [11:06:27] if there is a risk of more creeping up, maybe a netbox report [11:06:58] I don't think there's much risk of more being added, the skip works fine on the puppetdb import [11:07:03] so it's just the legacy ones [11:09:17] yeah for those even a one of playing with dbshell is fine, jsut make sure to include a changelog record [11:09:53] volans: thanks yeah I'll do a dry-run kind of pass and see how it looks [11:11:00] topranks: see for example https://phabricator.wikimedia.org/T271143#7953387 [11:12:42] that's nice. my go-to is usually pynetbox, but I guess it would be a lot quicker doiing it directly like that [11:13:33] volans: is netbox-next in a state to test on or should I hold off? [11:13:41] give me 5 [11:13:56] no rush at all far from urgent [11:19:01] topranks: there is a bug I need to dig into but after lunch [11:19:32] cool no probs [11:29:41] Here is the list of them anyway, looks safe to delete I think [11:29:42] https://phabricator.wikimedia.org/P47517 [11:40:31] topranks: do they have any dns_name attached? [11:40:45] otherwise +1 [11:42:43] XioNoX: Just re-checked, none of them have a dns_name [11:43:35] cool, then fine by me [11:44:14] cool, I'll wait and test with the delete op on netbox-next first just to be safe [11:50:11] hopefully https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/915591 will solve the last netbox-next issue [12:42:06] topranks: you should be good with netbox-next [12:42:26] XioNoX: thanks! nice work sorting that out :) [20:37:42] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:42] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.954% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace