[08:14:48] (PuppetFailure) firing: Puppet has failed on logging-hd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:29:48] (PuppetFailure) resolved: Puppet has failed on logging-hd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:52:47] Hey o11y peeps, I need some help with unit testing a new alert, https://gerrit.wikimedia.org/r/c/operations/alerts/+/1009493 [11:53:19] I'm not sure why my input series are not at least triggering the alert, never mind the right value in the description/summary [11:58:02] claime: your alert uses rate(), but the test data is constant [11:58:08] aaaah [11:58:14] right [11:58:24] so the rate is actually 0 [11:59:34] yep, just me being an idiot as usual, thanks taavi! [13:44:48] (PuppetFailure) firing: Puppet has failed on logging-hd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:24:48] (PuppetFailure) resolved: Puppet has failed on logging-hd1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:02:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:11:09] I just bounced prometheus@k8s on eqiad, proceeding with codfw [16:11:33] I think I need to do the same for other k8s prometheus, looks like it's https://phabricator.wikimedia.org/T343529 again [16:12:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:17:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:22:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:26:45] claime: need help? [16:26:51] yeah [16:27:19] I bounced prometheus@k8s in eqiad and codfw [16:27:31] but I don't think it fixed the problem [16:27:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:27:58] I've seen an OOM go through on prometheus2005, so that's probably part of the issue [16:28:24] maybe thanos needs a bounce as well [16:28:42] but I don't want to break things more, so I need an o11y adult or equivalent :p [16:28:42] Mar 07 16:24:39 prometheus2006 thanos-sidecar@k8s[3211227]: level=info ts=2024-03-07T16:24:39.582397996Z caller=intrumentation.go:56 msg="changing probe status" status=ready [16:29:11] don't have that same message on prom2005 tho [16:30:52] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:30:54] ah [16:30:57] claime: prom2005 is still loading the WAL [16:31:45] so it's not actually serving yet [16:32:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:32:35] it took about 20 minutes on prom2006? [16:33:09] er, no, like 14 [16:33:59] Mar 07 16:32:44 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:32:44.361Z caller=head.go:798 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=3m6.698411303s wal_replay_duration=8m33.278299472s wbl_replay_duration=3.941µs total_replay_duration=11m46.8813057s [16:34:01] Mar 07 16:33:38 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:33:38.119Z caller=main.go:1045 level=info fs_type=EXT4_SUPER_MAGIC [16:34:03] Mar 07 16:33:38 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:33:38.119Z caller=main.go:1048 level=info msg="TSDB started" [16:34:05] Mar 07 16:33:38 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:33:38.119Z caller=main.go:1230 level=info msg="Loading configuration file" filename=/srv/prometheus/k8s/prometheus.yml [16:34:07] Mar 07 16:33:38 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:33:38.178Z caller=main.go:1267 level=info msg="Completed loading of configuration file" filename=/srv/prometheus/k8s/prometheus.yml totalDuration=58.866015ms db_storage=1.264µs remote_storage=1.9µs web_handler=1.835µs query_engine=1.409µs scrape=243.786µs scrape_sd=1.440316ms notify=22.566µs notify_sd=16.642µs [16:34:09] rules=54.205036ms tracing=11.038µs [16:34:11] Mar 07 16:33:38 prometheus2005 prometheus@k8s[2744166]: ts=2024-03-07T16:33:38.178Z caller=main.go:1009 level=info msg="Server is ready to receive web requests." [16:35:41] (PrometheusRuleEvaluationFailures) firing: (11) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:35:52] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:37:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:40:41] (PrometheusRuleEvaluationFailures) resolved: (11) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:40:52] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:41:08] hmm [16:41:13] claime: better? [16:41:35] yeah looks like metrics are back [16:42:00] very strange [16:42:04] ~30 minutes hole [16:42:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (3) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:47:50] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:48:51] *that's* about the sidecar for prometheus ops @ eqiad ? [16:51:44] it's from the ops prometheus, but that doesn't mean it's that prometheus thanos can't access [16:52:14] Mar 7 16:40:52 prometheus1005 thanos-sidecar@k8s[1998076]: level=warn ts=2024-03-07T16:40:52.501992447Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="perform GET request against http://localhost:9906/k8s/api/v1/status/config: Get \"http://localhost:9906/k8s/api/v1/status/config\": dial tcp [::1]:9906: connect: connection refused" [16:53:29] looks like prometheus@k8s eqiad got oomkilled [16:54:05] but I'm seeing it running so it may be loading WAL again [16:55:00] Mar 7 16:52:33 prometheus1005 prometheus@k8s[1582505]: ts=2024-03-07T16:52:33.245Z caller=head.go:798 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=3m13.579923074s wal_replay_duration=8m16.749369822s wbl_replay_duration=195ns total_replay_duration=11m33.054446466s [16:55:28] yeah there's still a couple minutes after that message until the "ready to receive web requests" one [16:55:53] it responds to curl rn, so thanos should see its heartbeat soon-ish [16:56:49] Mar 07 16:53:52 prometheus1005 thanos-sidecar@k8s[1998076]: level=info ts=2024-03-07T16:53:52.519825957Z caller=intrumentation.go:56 msg="changing probe status" status=ready [16:57:05] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:57:56] it got oomkilled again >:| [16:58:04] so now it's loading WAL again [17:03:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:09:36] I'm taking a look. I'm trying to figure out the issue. [17:10:37] claime: is it only prometheus1005 getting OOM-killed? [17:10:57] cwhite: prometheus2005 had an oomkill as well iirc [17:11:07] which hosts are affected? [17:11:08] but it's not getting killed in a loop I think [17:13:20] prometheus2005 and prometheus1005 were impacted, we were completely blind on all metrics from k8s for about half an hour, and partially for longer [17:13:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:13:48] I didn't do a full debug, I went and bounced prometheus@k8s on all nodes in eqiad and codfw [17:13:56] (first eqiad, then codfw) [17:16:33] Looks like a repeat of T354399 [17:16:33] T354399: Prometheus @ k8s OOM loop - https://phabricator.wikimedia.org/T354399 [17:18:06] * cwhite tries the remedy [17:19:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:29:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:39:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:44:50] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:54:50] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (3) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [18:09:50] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [18:15:47] I've finished rolling through the wal cleanup and restarts. Will continue to watch through the day for more OOMs. [18:16:02] cwhite: Thank you! [18:17:40] cwhite: I think that a cookbook to do the wal cleanup and restarts could be useful on this cases. What do you think? [18:23:19] I wouldn't consider this a normal maintenance operation. [18:25:06] I could see it used in emergencies like this though. [18:43:56] I documented the issue and the steps to fix it in our Thanos runbooks: https://wikitech.wikimedia.org/wiki/Thanos#Thanos_sidecar_no_connection_to_started_Prometheus [18:47:28] I noticed a section titled "Thanos Sidecar cannot connect to Prometheus" in our runbooks documentation that was empty. This seems closely related to the current alert. However, I've added this alert under "Thanos Sidecar no connection to started Prometheus" to distinguish between the two, as they might be separate alerts with similar symptoms. [20:05:39] let's say you already have a dashboard for a given host that shows data about Apache but it just didnt have the exporter.. and then you install the prometheus-apache-exporter deb package.. is there any other step you would expect until data shows up? just wait a little bit? [20:14:28] Hello mutante, if the metrics are already available via HTTP, the next step would be to configure Prometheus to scrape these metrics by adding a new job. [20:14:28] I've included a link to our documentation on this subject for your reference. If you have any further inquiries, please don't hesitate to reach out: https://wikitech.wikimedia.org/wiki/Prometheus#Adding_new_metrics [20:53:34] denisse: thank you