[10:51:27] <jayme>	 alertmanager-eqiad.wikimedia.org does not serve a proper cert, is that expected?
[10:51:28] <jayme>	 The certificate is only valid for the following names: alert1001.wikimedia.org, alert2001.wikimedia.org, alerts.wikimedia.org, icinga-extmon.wikimedia.org, icinga.wikimedia.org, klaxon.wikimedia.org
[10:52:36] <godog>	 jayme: yes as the expected usage is internal and on http, though we could fix that
[10:53:07] <jayme>	 ah, I clicked a link in karma following the creation of a silence - that led me there
[10:54:57] <godog>	 heh yeah that's confusing, Debian doesn't ship the alertmanager UI and we don't either, in theory that would point to the AM UI
[11:49:16] <jayme>	 unfortunalty rsyslog is still messing up godog :/ https://phabricator.wikimedia.org/T357616#9735153
[11:49:59] <jayme>	 I now see that we also index the events from k8s with the wrong timestamp (timestamp == ingestion time, not log time)
[12:22:35] <godog>	 jayme: sigh, ok
[12:23:29] <jayme>	 I'll raise this in todays k8s-sig meeting...maybe somebody has capacity to look into alternatives
[12:23:49] <jayme>	 but we'd probably need at least some support from o11y I suppose
[12:24:39] <jayme>	 as it is now, it's not really sustainable as we never know if there was no log or we jsut did not collect it...
[12:26:21] <godog>	 jayme: yeah totally, I'm for applying a bandaid in the form of periodically restarting rsyslog on k8s nodes, not ideal of course but it'll work
[12:27:59] <jayme>	 yeah...maybe every n<24 hours even as we still loose logs when a pod is removed/re-sheduled and rsyslog hasn't processed the logfile yet
[12:29:51] <godog>	 totally, I'll get to it jayme 
[12:38:24] <jayme>	 ah, that would be nice! I've just prepared a change to remove the fd leak bandaid. That did not get triggered since the latest rsyslog update
[12:41:40] <btullis>	 Quick question, if we have added a simple file target to the analytics prometheus instance, what is the best way to refresh the config? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023085/2/modules/profile/manifests/prometheus/analytics.pp#313
[12:44:58] <btullis>	 Is it `systemctl restart prometheus@analytics`or is the `/-/reload` endpoint used, at all?
[12:46:15] <godog>	 btullis: config is reloaded automatically, the problem is that the patch AFAICS isn't complete, there's nothing to associate the new file with a job
[12:46:40] <btullis>	 Ah, sorry.
[12:46:56] <awight>	 oof okay good to know!
[12:49:26] <awight>	 godog: is it the "labels" which should be added?
[12:52:04] <godog>	 awight: no, in addition the file you added you also need to define a job for prometheus to pick up said file, see for example line 115 to 121
[12:52:36] <godog>	 and make sure said job is part of the list in scrape_configs_extra at line 335
[12:52:44] <awight>	 thanks!
[12:53:00] <godog>	 sure np
[13:07:58] <awight>	 godog: My rudimentary attempt to fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023424
[13:08:18] <awight>	 (kk I see you already reviewed)
[13:13:07] <godog>	 yeah, easy enough
[13:13:47] <godog>	 jayme: sth like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023422 and we can apply where needed
[13:14:40] <godog>	 awight: I'll merge it
[13:14:43] <awight>	 ty!
[15:50:14] <jayme>	 godog: thanks. Feel free to include it in the kubernetes::node profile right away I'd say
[15:57:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[15:57:41] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: (27) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[15:57:52] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[15:58:17] <denisse>	 ^ Looking.
[16:02:41] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) resolved: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[16:03:17] <denisse>	 BTW, prometheus1005 has netbox status failed: https://netbox.wikimedia.org/dcim/devices/3649/
[16:07:33] <denisse>	 The errors graph is starting to look healthy https://grafana.wikimedia.org/goto/N-nxaNfIg?orgId=1 so I think the usual remedy may not be necessary: https://wikitech.wikimedia.org/wiki/Thanos#Thanos_sidecar_no_connection_to_started_Prometheus
[16:07:41] <denisse>	 I'll be monitoring it.
[16:15:22] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[17:05:47] <herron>	 denisse thanks yes should be recovering since T360687 
[17:05:47] <stashbot>	 T360687: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687