[11:20:58] herron: o/ [11:21:20] I have sent two code reviews for pyrra, not 100% sure if they are good, lemme know when you are online so we can discuss how to proceed :) [11:21:40] so far the feeling is that pyrra is way more user friendly than grizzly [12:15:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:20:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:25:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:25:27] (PrometheusRuleEvaluationFailures) firing: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:30:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:30:27] (PrometheusRuleEvaluationFailures) resolved: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=eqiad%20prometheus%2Fops - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [12:33:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [12:38:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:20:10] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:25:10] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [14:51:41] elukey: sweet! added some comments on the patch [14:53:13] herron: thanks! Replied, I have some doubts how the "metric" used in the config, namely if "increase" may interfere with what pyrra does behind the scenes [14:56:58] mmhh the thanos errors above are due to prometheus/k8s in codfw not being happy (SIGKILL afaics) [14:58:53] godog: do you want me to hold off with https://gerrit.wikimedia.org/r/c/operations/puppet/+/974148/1 ? [15:00:46] elukey: thank you for asking, no I think that's fine [15:35:18] herron: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974149 should be ready to go! [15:41:31] elukey: 🙌 [15:41:40] thank you! [15:46:12] herron: niceee!! Shall I just merge and wait puppet? Or is there anything to do? [15:46:50] yup just a puppet run, and thanos rule may need a reload if it doesn't pick up the output rules right away [15:50:53] super trying [16:02:49] very weird, I see something like [16:02:49] msg="objective with grouping unsupported with generic rules" objective=liftwing-requests-revscoring [16:03:02] in pyrra-filesystem (on titan1001) [16:03:41] ah no ok it is a warning also in other use cases [16:04:00] but there is a clear "failed to reload prometheus" sigh [16:04:49] yeah thats expected for the moment, since we are running thanos there is no prometheus service to reload [16:05:10] afaik it doesn't support a custom command there, I should open something with upstream about it [16:08:11] herron: ah snap I relized that "grouping" doesn't do what I thought, I wanted to create a separate SLO for each revscoring-* namespace [16:08:35] but afaics now I have a single one with the sum of all the reqs [16:10:05] basically https://gerrit.wikimedia.org/r/c/operations/puppet/+/974214 [16:10:08] IIUC [16:10:26] there it goes, it just takes a min while the recording rules catch up [16:10:40] lots of SLOs now 😅 [16:41:50] herron: yes yes what I meant is if my understanding of "grouping" is right or not, namely if I'll see one SLO for each "destination_service_namespace" or not [16:42:23] elukey: does it look right if you reload now? [16:42:47] herron: ah wow now I see! [16:42:58] it is in the main page [16:43:09] ohh ok separated by site [16:43:12] got it now [16:43:49] ahh got ya, yeah it essentially creates them as line items [16:48:14] I'll check numbers tomorrow, but the dashboards look way nicer [16:49:04] sounds good!