[02:18:33] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:48:33] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:43] <volans>	 jbond: can bre.tt's above backlog and https://phabricator.wikimedia.org/T342537#9240999 be both related to the changes to reimage and late command?
[09:06:48] <XioNoX>	 volans: could be worth a "rollback and verify" as it's a blocker
[09:07:07] <volans>	 do any reimage fail? or only some?
[09:07:34] <volans>	 I know that john was testing the various scenarios so I'd like his input on the current status before touching it
[09:09:21] <jbond>	 ill kick of a reimage now
[09:09:52] <jbond>	 volans: there is also https://gerrit.wikimedia.org/r/c/operations/puppet/+/964959 
[09:10:58] <volans>	 +1ed
[09:11:21] <jbond>	 thx
[09:15:48] <moritzm>	 I'm currerntly looking at cloudvirt1064
[09:16:01] <moritzm>	 if the installer logs show anything interesting
[09:18:26] <jbond>	 ack cheers i did have an issue reimagiung a  bookworm with  puppet 7.  but at least yesterday bookworm puppet 5 worked fine
[09:19:59] <moritzm>	 the reimage of cloudvirt1064 was interrupted because it failed to read from /tmp/puppet_version
[09:20:07] <moritzm>	 L13 in the current version as merged in puppet.git
[09:20:58] * volans rebooting irc bouncer, brb
[09:21:01] <moritzm>	 Papaul attempted a bullseye installation using Puppet 5
[09:24:00] <moritzm>	 I think this was just a timing issue, once https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/964007 is merged this should no longer happen
[09:24:07] <jbond>	 moritzm: ack im doing a bullseye with puppet 5 now
[09:24:46] <moritzm>	 but it would equally fail, right? without https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/964007 merged, nothing writes to /tmp/puppet_version
[09:25:14] <jbond>	 moritzm:  yes i think i can send a quick patch to late_command to just assume puppet 5 if the file is not there 
[09:26:05] <moritzm>	 agreed, I think that would be a sane fallback
[09:26:40] <volans>	 so to summarize, the problem was that late_command was merged by the reimage not because yet under testing and late_command didn't had a backward compatible fallback, correct?
[09:26:54] <jbond>	 correct
[09:29:13] <jbond>	 moritzm: volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965061
[09:32:52] <moritzm>	 volans: indeed
[09:33:01] <moritzm>	 jbond: looks good, +1d
[09:33:05] <jbond>	 cheers
[09:33:22] <moritzm>	 I'll drop Papaul a quick note on the cloudvirt1064 task
[09:33:29] <volans>	 +1ed too
[09:33:33] <jbond>	 thx
[09:34:14] <volans>	 brett: this is the fix ^^^ by the time you'll be online it should be tested and safe to retry
[10:08:42] <jbond>	 fyi (cc brett) i have now tested a buster reimage and all went fine
[10:18:58] <volans>	 great
[11:14:07] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) 05Open→03Resolved Changes pushed to production, closing task.
[11:55:18] <slyngs>	 jbond: Okay to merge your Puppet patch?
[11:55:28] <jbond>	 slyngs: yes please do
[11:56:08] <slyngs>	 Done
[11:56:14] <jbond>	 thanks
[12:21:50] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10SRE, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) I really think that we need to find a solution for this. It has been pending for too long.  Today I did a check of the dns repository a...
[14:03:33] <jinxer-wm>	 (SystemdUnitFailed) firing: nginx.service Failed on apt1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:35] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) nginx.service Failed on apt1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:19:58] <brett>	 volans, jbond: Thanks for all your work!
[17:13:35] <brett>	 Regarding sre.hosts.reimage, would it make sense to default to depooling the hosts that are being reimaged? I got bitten by thinking it did that by default and have been reimagingwhile still pooled
[17:22:53] <volans>	 not all hosts are "depoolable" in the standard pybal way, some needs to be depooled in a different way, some are not pooled at all
[17:23:43] <volans>	 at most we could add a check if the host is part of any conftool pool and ask the user, but that's just one specific use case
[17:24:16] <volans>	 surely the most common, but there are various others
[17:24:35] * volans has to go afk now, will commen tomorrow on any followup (cc brett)
[18:47:40] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) I lab tested this and the "always-compare-med" command works as expected (see P52912).  >>! In T348446#9238640, @ayounsi wrote: > Some of our...
[18:59:22] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:58:33] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:33] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-wikifunctions_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:58:33] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed