[02:18:33] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:33] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:43] jbond: can bre.tt's above backlog and https://phabricator.wikimedia.org/T342537#9240999 be both related to the changes to reimage and late command? [09:06:48] volans: could be worth a "rollback and verify" as it's a blocker [09:07:07] do any reimage fail? or only some? [09:07:34] I know that john was testing the various scenarios so I'd like his input on the current status before touching it [09:09:21] ill kick of a reimage now [09:09:52] volans: there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/964959 [09:10:58] +1ed [09:11:21] thx [09:15:48] I'm currerntly looking at cloudvirt1064 [09:16:01] if the installer logs show anything interesting [09:18:26] ack cheers i did have an issue reimagiung a bookworm with puppet 7. but at least yesterday bookworm puppet 5 worked fine [09:19:59] the reimage of cloudvirt1064 was interrupted because it failed to read from /tmp/puppet_version [09:20:07] L13 in the current version as merged in puppet.git [09:20:58] * volans rebooting irc bouncer, brb [09:21:01] Papaul attempted a bullseye installation using Puppet 5 [09:24:00] I think this was just a timing issue, once https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/964007 is merged this should no longer happen [09:24:07] moritzm: ack im doing a bullseye with puppet 5 now [09:24:46] but it would equally fail, right? without https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/964007 merged, nothing writes to /tmp/puppet_version [09:25:14] moritzm: yes i think i can send a quick patch to late_command to just assume puppet 5 if the file is not there [09:26:05] agreed, I think that would be a sane fallback [09:26:40] so to summarize, the problem was that late_command was merged by the reimage not because yet under testing and late_command didn't had a backward compatible fallback, correct? [09:26:54] correct [09:29:13] moritzm: volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965061 [09:32:52] volans: indeed [09:33:01] jbond: looks good, +1d [09:33:05] cheers [09:33:22] I'll drop Papaul a quick note on the cloudvirt1064 task [09:33:29] +1ed too [09:33:33] thx [09:34:14] brett: this is the fix ^^^ by the time you'll be online it should be tested and safe to retry [10:08:42] fyi (cc brett) i have now tested a buster reimage and all went fine [10:18:58] great [11:14:07] 10netops, 10Infrastructure-Foundations, 10SRE: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) 05Open→03Resolved Changes pushed to production, closing task. [11:55:18] jbond: Okay to merge your Puppet patch? [11:55:28] slyngs: yes please do [11:56:08] Done [11:56:14] thanks [12:21:50] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10SRE, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) I really think that we need to find a solution for this. It has been pending for too long. Today I did a check of the dns repository a... [14:03:33] (SystemdUnitFailed) firing: nginx.service Failed on apt1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:35] (SystemdUnitFailed) firing: (3) nginx.service Failed on apt1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:58] volans, jbond: Thanks for all your work! [17:13:35] Regarding sre.hosts.reimage, would it make sense to default to depooling the hosts that are being reimaged? I got bitten by thinking it did that by default and have been reimagingwhile still pooled [17:22:53] not all hosts are "depoolable" in the standard pybal way, some needs to be depooled in a different way, some are not pooled at all [17:23:43] at most we could add a check if the host is part of any conftool pool and ask the user, but that's just one specific use case [17:24:16] surely the most common, but there are various others [17:24:35] * volans has to go afk now, will commen tomorrow on any followup (cc brett) [18:47:40] 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) I lab tested this and the "always-compare-med" command works as expected (see P52912). >>! In T348446#9238640, @ayounsi wrote: > Some of our... [18:59:22] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:33] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:33] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:33] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed