[19:08:35] FIRING: ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [19:09:54] hi folks. something is up with grafana but more specifically prometheus1005 [19:10:20] sukhe: Here, I'll take a look. [19:10:31] thanks! [19:10:35] Thanks to both of you! [19:13:35] FIRING: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [19:18:34] denisse: also look at the dmesg output on prom1005; prom@k8s.service got OOM'ed [19:18:35] RESOLVED: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [19:18:38] might be worthwhile to see why [19:18:41] but thanks for fixing it! [19:19:39] sukhe: Thanks, I'm looking at it. It correlates with an issue in the logs of prometheus@ops.service [19:20:54] The symptoms are very similar to the ones described in here: https://phabricator.wikimedia.org/T354399 [19:22:19] And we have an ~3 min gap in the graphs: https://grafana.wikimedia.org/goto/XBjONj2NR?orgId=1 [19:22:21] ah yeah, seems familiar [19:25:03] I think I'll re-open the task and add this occurrence in it.