[00:47:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:43] FIRING: [10x] ThanosSidecarBucketOperationsFailed: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [13:30:16] anybody from o11y around? [13:30:30] we're discussing on -operations an issue with alerts.w.o and the probe pages [14:19:57] to close the loop in this channel quickly -- we found stale alerts in alertmanager with alertname=ProbeDown, they didn't line up with something actually firing in prometheus and a restart of alertmanger on alert1001 cleared them [14:20:39] fun [16:48:25] FIRING: SystemdUnitFailed: ledmon.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:18] ^ This is expected, jclark-ctr and I are working on it. [17:05:55] RESOLVED: SystemdUnitFailed: ledmon.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:45] what was this about? https://icinga-extmon.wikimedia.org/ [17:12:54] and expected that I get Forbidden there? [17:13:22] sounds like external monitoring / points to alert1001 [17:23:46] mutante: https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [17:27:23] cdanis: thank you! oh, I made wrong assumptions like "if it's external then how can it also be on alert1001" [17:27:40] that is what is being monitored