[02:44:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [06:44:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [10:44:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [11:33:16] hey team, XioNoX and I (on call) observed some probes failing [11:33:35] we think it is related to prometheus hosts themselves, not the monitored services [11:33:53] https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&from=now-24h&to=now&viewPanel=2 [11:34:04] Maybe some kind of overload [11:35:53] pinging tappof as I think you're the closest timezone wise [11:36:10] yeah, I was checking, as Filippo is out [11:40:42] in any case, is it normal that so many services report a 50% availability ? [11:41:06] could there be some missconfiguration? [11:41:44] https://grafana.wikimedia.org/goto/M1yv2SnHR?orgId=1 [11:42:17] (e.g. the probes are checking old endpoints?) [11:45:22] thanos-swift look normal, though [11:48:49] things seem to be going better, but maybe they can check it later [11:51:04] I was looking at various host and prometheus monitoring dashboards, but can't find anything out of the ordinary [11:51:28] nothing relevant about prometheus in SAL neither [14:14:06] XioNoX: I'll take a look [14:44:41] FIRING: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [15:09:41] RESOLVED: [4x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance prometheus1005:9900) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures