[11:57:04] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:13:24] <elukey>	 Emperor: o/
[12:13:43] <elukey>	 found the right firmware for thanos-be, hopefully today/tomorrow we have the nodes ready to go
[12:13:52] <elukey>	 we need to verify if the reimage issue still persists
[12:13:56] <elukey>	 but the rest is fixed
[12:18:25] <Emperor>	 elukey: thanks for the update. If the reimage issue _does_ still persist, does that mean the node can be made to work it'll just be a bit more annoying to do so, or is that still a showstopper?
[12:18:56] <Emperor>	 [I presume you still want the new ms-be* nodes for testing / bottoming out the remaining issues? ]
[12:27:04] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:43:12] <Emperor>	 ^-- YA deletion race lost
[14:31:49] <elukey>	 Emperor: back sorry, so in theory for thanos-be it could not be a showstopper given the urgency, it is annoying and it would be nice to fix it before we put the first UEFI node in prod, but not mandatory
[14:32:17] <elukey>	 the pessimist in my is a bit reluctant to proceed without hesitation since they may be some hidden UEFI traps that we don't know about
[14:34:14] <elukey>	 the main worry that I have is unwanted PXE debian installs (say after a reboot for unknown unknowns that we haven't discovered yet) 
[14:34:37] <elukey>	 but it would also need a DHCP config injected, so we already have some protection in place
[14:35:06] <elukey>	 maybe for storage nodes we could add another level of paranoia/precaution and set a partman recipe (post-install) that doesn't allow a full reimage
[14:35:13] <elukey>	 IIRC we had something similar for db nodes
[14:59:19] <Emperor>	 elukey: Hm, I guess see if the reimage problem still persists, and work out where to go from there?
[15:00:07] <elukey>	 sure sure
[15:00:30] <elukey>	 but in any case, even if it doesn't persist, I'd put some extra fences on those hosts just to be sure
[15:00:41] <elukey>	 I don't expect horrible things, but better safe than sorry :D
[15:29:09] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:30:20] <Emperor>	 That's the T372207 silence expiring again, I'll extend.
[15:30:20] <stashbot>	 T372207: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207
[16:57:04] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:58:28] <jynus>	 the alert panel is in a scary status, lots of ongoing alerts, db1246, dbproxy1026, an-redacteddb1001, etc
[16:59:48] <jynus>	 an-worker1088
[20:57:04] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed