[11:57:04] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:24] Emperor: o/ [12:13:43] found the right firmware for thanos-be, hopefully today/tomorrow we have the nodes ready to go [12:13:52] we need to verify if the reimage issue still persists [12:13:56] but the rest is fixed [12:18:25] elukey: thanks for the update. If the reimage issue _does_ still persist, does that mean the node can be made to work it'll just be a bit more annoying to do so, or is that still a showstopper? [12:18:56] [I presume you still want the new ms-be* nodes for testing / bottoming out the remaining issues? ] [12:27:04] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:12] ^-- YA deletion race lost [14:31:49] Emperor: back sorry, so in theory for thanos-be it could not be a showstopper given the urgency, it is annoying and it would be nice to fix it before we put the first UEFI node in prod, but not mandatory [14:32:17] the pessimist in my is a bit reluctant to proceed without hesitation since they may be some hidden UEFI traps that we don't know about [14:34:14] the main worry that I have is unwanted PXE debian installs (say after a reboot for unknown unknowns that we haven't discovered yet) [14:34:37] but it would also need a DHCP config injected, so we already have some protection in place [14:35:06] maybe for storage nodes we could add another level of paranoia/precaution and set a partman recipe (post-install) that doesn't allow a full reimage [14:35:13] IIRC we had something similar for db nodes [14:59:19] elukey: Hm, I guess see if the reimage problem still persists, and work out where to go from there? [15:00:07] sure sure [15:00:30] but in any case, even if it doesn't persist, I'd put some extra fences on those hosts just to be sure [15:00:41] I don't expect horrible things, but better safe than sorry :D [15:29:09] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:30:20] That's the T372207 silence expiring again, I'll extend. [15:30:20] T372207: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207 [16:57:04] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:28] the alert panel is in a scary status, lots of ongoing alerts, db1246, dbproxy1026, an-redacteddb1001, etc [16:59:48] an-worker1088 [20:57:04] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed