[07:57:25] FIRING: SystemdUnitFailed: statograph_post.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:00] question: if there is a degraded raid array, and it's present in the icinga web ui, should there already be a phab open? it looks like that happens as an event handler so I'd assume 'yes' [13:51:08] specifically, I'm looking at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=aqs1012&service=MD+RAID, for which I can find no ticket for [14:06:07] this should have created a task, yes. so there's likely some kind of bug [14:06:18] e.g. as https://phabricator.wikimedia.org/T381742 in the past [14:07:19] Yes urandom, I'm trying to find out what happened [14:22:47] tappof: I think I found the applicable log(s), but they weren't helpful [14:24:01] yes urandom, Yes, urandom. I've seen that the handler was called by Icinga, but it seems that it didn't actually run [14:25:00] r at least the last executed run was on 2025-05-07 [14:30:40] that suspiciously aligns with the merge date of the work to add support for the new storcli: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143052 and the related patches were merged on May 7 [14:49:23] urandom: I called the script manually and it worked: https://phabricator.wikimedia.org/T395685 [14:49:55] I opened a task for observability-alert to further investigate the issue. [14:53:14] https://phabricator.wikimedia.org/T395688 [14:55:37] tappof: great; thank you [15:47:53] [non-urgent] hello o11y friends - I have a patch adding a new alert, using an idiom that doesn't _seem_ to have precedent in operations/alerts: a recording rule to avoid a subquery in the alert expr. [15:47:53] is there someone on the team who would be interested in reviewing this, particularly from the standpoint of whether it's acceptable to introduce such a precedent? :) [17:40:48] Hi swfrench-wmf! I don't see an issue with that, but I'd check with g.odog to be sure. [17:41:41] cwhite: thanks! cool, I'll follow up with g.odog when he returns next week :)