[08:47:57] should a degraded drive rebuild be high priority or "unbreak now"? https://phabricator.wikimedia.org/T420873 [08:56:21] federico3: I usually reckon disk failures are high, not UBN [08:56:56] [rationale: our systems should survive disk failure, and a disk might fail on a Friday, so going a couple of days before it gets swapped should be OK, thus not UBN] [10:15:32] FIRING: SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:33] looks like we may have lost thanos-be2006 [10:22:46] unresonsive on ssh serial console [10:23:41] Description: A fatal error was detected on a component at bus 61 device 0 function 0. [10:24:01] also Description: CPU 1 machine check error detected. [10:24:21] power-cycling [10:28:26] RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:07] those getsel errors date from the 21st (which is also when puppet last ran OK), but it's booted back up OK now. Let's see how it goes... [17:20:08] db1170 is slowly pooling in after rebuilding the raid drive https://phabricator.wikimedia.org/T420873