[11:03:42] Do we have any existing process monitoring via Prometheus? I don't believe that we use Process export, and I'm not sure how I feel about bringing that in. It just feels a little wrong to hack process information into node exporter or other relevant exporters [11:24:14] slyngs: we do in the form of checking for systemd unit failed [12:24:39] Does that still work if it's a init.d script (started by systemd) that spawns multiple processes? [13:13:19] slyngs: yes that'll work too [13:13:35] in general I think we can rip all check_proc from icinga [13:13:35] Perfect, then I just need to delete stuff :-) [13:13:43] exactly, good times [13:13:44] Awesome [13:16:10] So I have a storage related question regarding Prometheus. In https://gerrit.wikimedia.org/r/c/operations/puppet/+/989458, I am adding more labels to a recording rule. Specifically, more label variations to be recorded, beyond tyhe current set of four, to seven. The underlying metrics have a caridnality for that label of 19. I would expect most of our services to never use top 1/3 [13:16:13] or so, as in: they'd always be 0. Looking at the change, and wondering if what we currently record is enough: if the top 1/3 is always 0, and most if not all queries only ask about a specific bucket, shouldn't we just ditch the whole constrain and record all? What do people think? [13:16:20] there will be prometheus alerts, I'm testing new prometheus on prometheus1005 [13:17:49] klausman: yeah ditching the le match seems reasonable to me [13:18:15] though I don't know why the le match was there in the first place [13:22:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:23:56] godog: the initial recording rule was mostly so when used in SLOs, Thanos/Prom can actually handle, because the cardinality of the underlying metric is insane when looked at in a 90d window. When I initially made the RR, I went pretty aggressively on the "pare down labels" side. I am now wondering if maybe I was too aggressive on `le` specifically. [13:26:47] klausman: ack, then yeah if 'le' is more or less matching everything already might as well ditch it [13:27:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:27:32] Will update the patch [13:31:51] and done. [14:56:58] godog re: our discussion yesterday. If I set "send_resolved:true" on a phab task receiver will it post a follow-up when the alert clears? [14:59:28] inflatador: I haven't tried though the code assumes it isn't receiving resolved notifications [14:59:38] "the code" being https://github.com/knyar/phalerts [15:00:10] ah, thanks for the link! I'll take a look [16:37:13] upgrading prometheus on prometheus2006, there will be a thanos alert [16:44:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:49:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [18:25:41] (PrometheusRuleEvaluationFailures) firing: (3) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [18:26:52] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:30:42] (PrometheusRuleEvaluationFailures) resolved: (24) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [18:31:52] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:53:41] (PrometheusRuleEvaluationFailures) firing: (31) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [18:53:52] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:58:42] (PrometheusRuleEvaluationFailures) resolved: (34) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [19:03:52] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures