[09:33:25] FIRING: SystemdUnitFailed: export_smart_data_dump.service on dbprov2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:00] ^expected? [10:14:23] federico3: can you check? ^ [10:15:17] looking [10:15:22] thank you! [10:18:58] indeed there was a timeout in a raid status reporting tool https://phabricator.wikimedia.org/P75610 [10:20:03] if that persists, it may be a sign the host is unhappy [10:22:04] it only timed out once, other runs are returning immediately [10:23:25] RESOLVED: SystemdUnitFailed: export_smart_data_dump.service on dbprov2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:30] cool :) [10:23:39] the systemd unit does not have a retry mechanism. The script is pretty simple and is opening /proc/bus/pci/devices and /proc/mdstat - the fact that it timed out doing that it's pretty sus [10:23:58] unfortunately it does not log its activity - is this a known bug? [10:26:02] there are a bunch of phab issues (if you search for smart-data-dump) [10:26:21] e.g. T267135 [10:26:22] T267135: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 [10:42:56] Could I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139810 please? The new apus backends will have a boss card, so I don't know how the SSDs will present to the OS, so set them to use manual setup until we can figure out what actually needs doing with them... [10:46:20] thanks :) [10:46:30] :) [14:14:25] FIRING: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:25] RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed