[15:23:25] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:25] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:27] FIRING: IcingaOverload: Checks are taking long to execute on alert1002:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [18:02:27] FIRING: [2x] IcingaOverload: Checks are taking long to execute on alert1002:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [18:12:02] howdy, moving this over from #wikimedia-operations. we (fr-tech) have hit a case of all of our passive checks getting in an awol state. this is similar to what we have seen before in https://phabricator.wikimedia.org/T196336 c.danis restared the nsca service for us and it started to recover but is back into primarly awol responses. [18:13:25] i'm not sure how to proceed here, but it appears that the nsca restarts may only help temporarily or only for a short time. Our checks are still running cleanly on the hosts and not reporting any timeout errors. [18:20:08] and judging from T393630 also external commands are not processed either. Beside more in-depth investigation on why this is happening I would personally consider an icinga restart to see if that helps. My 2 cents ;) [18:20:08] T393630: Cookbook downtiming does not work, continues anyway - https://phabricator.wikimedia.org/T393630 [18:20:34] brett: ^^^ FYI [18:21:39] Thanks! [18:27:42] there was a restart at 2025-05-07T16:58:41Z and it persists. not saying another won't help, just giving timing. [18:29:02] dwisehaupt: only for nsca, not icinga itself [18:29:05] AFAICT [18:30:26] oh. yes. sorry, i conflated the two. [18:47:27] FIRING: [2x] IcingaOverload: Checks are taking long to execute on alert1002:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [19:07:27] RESOLVED: IcingaOverload: Checks are taking long to execute on alert2002:9245 - https://wikitech.wikimedia.org/wiki/Icinga#IcingaOverload - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org/?q=alertname%3DIcingaOverload [19:09:31] fwiw I bounced icinga in eqiad shortly after the 2x alert above, and just restarted codfw now. sorry for the delay I had to step away for school pickup and forgot to !log it [19:19:04] herron: thanks for that. it looks like it's cleared up for us now. [19:20:19] dwisehaupt: ok good, glad to hear its cleared up [19:20:39] .wg 3