[00:45:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:40] good morning!! I need help on why this alert (or its test) is not passing properly : https://gerrit.wikimedia.org/r/c/operations/alerts/+/1127041 any idea ? [08:53:34] aloha XioNoX, I'll take a look [08:56:03] <3 [09:17:34] XioNoX: I left some comments directly on Gerrit for simplicity [09:20:17] tappof: awesome, thanks! [09:27:28] tappof: interesting, running CI locally still doesn't see the issue as fixed, but in gerrit it's fine [09:32:32] @godog Thank you for the input last week. The conn track monitoring is not successfully removed from Icinga, with minimal noice. [09:34:22] slyngs: s/not/now/ ? :) [09:34:30] XioNoX: are you running the CI locally using docker? [09:34:35] tappof: yeah [09:34:45] XioNoX: Yeeeah :-) [09:34:48] `docker run --entrypoint tox alerts-tests` [09:35:30] Did you rebuild the container after applying the latest changes? [09:36:10] XioNoX: [09:37:37] ahh, no [09:37:41] :) [09:37:51] I didn't know that was needed [09:38:09] I thought it was just needed for the first run and then it would pickup whatever was local [09:38:40] The container does not bind-mount the current directory, so you need to rebuild it every time [09:42:08] noted [10:06:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:57] slyngs: woot woot!! nicely done [11:00:40] good point re: the fact that the container needs rebuilding, the docs are not clear on that point [12:56:35] is there a way to know if those alerts have triggered over the last few days https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126966 ? [12:57:45] XioNoX: yes, check out the alerts overview dashboard https://logstash.wikimedia.org/goto/f3e6181b03de7d5ca37a80e83990ae65 [13:07:10] ah right! thx [13:25:49] godog: looks like it's working decenly well : https://alerts.wikimedia.org/?q=%40cluster%3Dwikimedia.org&q=team%3Dsre&q=alertname%3DCoreRouterInterfaceDown https://phabricator.wikimedia.org/T389071 :) [13:26:52] XioNoX: \o/ \o/ very cool [13:27:33] godog: I think a netops tag will be needed at some point, to have an overview :) [13:30:43] XioNoX: indeed, can't chat now but happy to later [13:30:58] no rush [13:52:58] FIRING: PrometheusLowRetention: Prometheus k8s-aux is storing less than 20 days of data on prometheus2007:9911. - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-Prometheus=prometheus2007:9911 - https://alerts.wikimedia.org/?q=alertname%3DPrometheusLowRetention [14:06:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed