[08:47:57] <federico3>	 should a degraded drive rebuild be high priority or "unbreak now"? https://phabricator.wikimedia.org/T420873
[08:56:21] <Emperor>	 federico3: I usually reckon disk failures are high, not UBN
[08:56:56] <Emperor>	 [rationale: our systems should survive disk failure, and a disk might fail on a Friday, so going a couple of days before it gets swapped should be OK, thus not UBN]
[10:15:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:21:33] <Emperor>	 looks like we may have lost thanos-be2006
[10:22:46] <Emperor>	 unresonsive on ssh serial console
[10:23:41] <Emperor>	 Description: A fatal error was detected on a component at bus 61 device 0 function 0.
[10:24:01] <Emperor>	 also Description: CPU 1 machine check error detected.
[10:24:21] <Emperor>	 power-cycling
[10:28:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:29:07] <Emperor>	 those getsel errors date from the 21st (which is also when puppet last ran OK), but it's booted back up OK now. Let's see how it goes...
[17:20:08] <federico3>	 db1170 is slowly pooling in after rebuilding the raid drive https://phabricator.wikimedia.org/T420873