[10:51:27] alertmanager-eqiad.wikimedia.org does not serve a proper cert, is that expected? [10:51:28] The certificate is only valid for the following names: alert1001.wikimedia.org, alert2001.wikimedia.org, alerts.wikimedia.org, icinga-extmon.wikimedia.org, icinga.wikimedia.org, klaxon.wikimedia.org [10:52:36] jayme: yes as the expected usage is internal and on http, though we could fix that [10:53:07] ah, I clicked a link in karma following the creation of a silence - that led me there [10:54:57] heh yeah that's confusing, Debian doesn't ship the alertmanager UI and we don't either, in theory that would point to the AM UI [11:49:16] unfortunalty rsyslog is still messing up godog :/ https://phabricator.wikimedia.org/T357616#9735153 [11:49:59] I now see that we also index the events from k8s with the wrong timestamp (timestamp == ingestion time, not log time) [12:22:35] jayme: sigh, ok [12:23:29] I'll raise this in todays k8s-sig meeting...maybe somebody has capacity to look into alternatives [12:23:49] but we'd probably need at least some support from o11y I suppose [12:24:39] as it is now, it's not really sustainable as we never know if there was no log or we jsut did not collect it... [12:26:21] jayme: yeah totally, I'm for applying a bandaid in the form of periodically restarting rsyslog on k8s nodes, not ideal of course but it'll work [12:27:59] yeah...maybe every n<24 hours even as we still loose logs when a pod is removed/re-sheduled and rsyslog hasn't processed the logfile yet [12:29:51] totally, I'll get to it jayme [12:38:24] ah, that would be nice! I've just prepared a change to remove the fd leak bandaid. That did not get triggered since the latest rsyslog update [12:41:40] Quick question, if we have added a simple file target to the analytics prometheus instance, what is the best way to refresh the config? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023085/2/modules/profile/manifests/prometheus/analytics.pp#313 [12:44:58] Is it `systemctl restart prometheus@analytics`or is the `/-/reload` endpoint used, at all? [12:46:15] btullis: config is reloaded automatically, the problem is that the patch AFAICS isn't complete, there's nothing to associate the new file with a job [12:46:40] Ah, sorry. [12:46:56] oof okay good to know! [12:49:26] godog: is it the "labels" which should be added? [12:52:04] awight: no, in addition the file you added you also need to define a job for prometheus to pick up said file, see for example line 115 to 121 [12:52:36] and make sure said job is part of the list in scrape_configs_extra at line 335 [12:52:44] thanks! [12:53:00] sure np [13:07:58] godog: My rudimentary attempt to fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023424 [13:08:18] (kk I see you already reviewed) [13:13:07] yeah, easy enough [13:13:47] jayme: sth like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023422 and we can apply where needed [13:14:40] awight: I'll merge it [13:14:43] ty! [15:50:14] godog: thanks. Feel free to include it in the kubernetes::node profile right away I'd say [15:57:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:57:41] (PrometheusRuleEvaluationFailures) firing: (27) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [15:57:52] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [15:58:17] ^ Looking. [16:02:41] (PrometheusRuleEvaluationFailures) resolved: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [16:03:17] BTW, prometheus1005 has netbox status failed: https://netbox.wikimedia.org/dcim/devices/3649/ [16:07:33] The errors graph is starting to look healthy https://grafana.wikimedia.org/goto/N-nxaNfIg?orgId=1 so I think the usual remedy may not be necessary: https://wikitech.wikimedia.org/wiki/Thanos#Thanos_sidecar_no_connection_to_started_Prometheus [16:07:41] I'll be monitoring it. [16:15:22] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [17:05:47] denisse thanks yes should be recovering since T360687 [17:05:47] T360687: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687