[15:55:11] hello folks! [15:55:29] I know that you are all at the Summit but I wanted to report something weird that I am seeing with Thanos [15:56:18] The SLO dashboards for Lift Wing are showing some weird numbers, like SLO 1500%, so I tried to check the related metrics [15:56:23] the first example is https://w.wiki/9S2m [15:56:49] the first graph is the SLI metric (availability), the second is the numerator and the third the denominator [15:57:22] In theory the third should report the same metrics as the second, plus others (the 5xx errors etc..) [15:57:42] if I pick [15:57:43] istio_sli_availability_requests_total:increase5m{destination_canonical_service="revertrisk-language-agnostic-predictor-default", destination_service_namespace="revertrisk", prometheus="thanos-rule", response_code="200", site="codfw"} [15:57:53] in the second is ~3000, in the third 0 [15:58:14] not sure if I am missing something obvious or not [16:32:43] elukey: I'm seeing gaps in the sli panels on the grafana dashboards, hmm I wonder if we're having issues with these underyling recording rules [16:32:55] herron: o/ [16:33:11] hey :) [16:33:11] yeah that too, but it was something that we already had :( [16:33:39] I noticed https://gerrit.wikimedia.org/r/c/operations/puppet/+/992415 [16:33:46] but not sure if it plays a role or not [16:38:06] it seems harmless but the timing lines up roughly, I'm having a look through thanos logs [16:45:22] hmm no I'm off by a month wrt that patch. lines up somewhat with remediation steps in T356788 although still looking [16:45:22] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [17:01:44] elukey: hm! that's weird, you get good data if you try it with `response_code=~"[234].*"` or even `response_code=~"..."`, but nothing for `.*` or if you leave it out [17:03:38] rzl: o/ thanks for checking! So with the response_code=~... I get only eqiad with a value, not codfw [17:03:49] even stranger :D [17:04:30] oh you're right! I was just eyeballing the graph [17:05:28] to me it sort of smells like a query issue and not a recording rule problem, but I don't have anything concrete to point at [17:05:50] yeah same feeling, it is like hitting different endpoints [17:11:16] going afk for today folks, thanks a lot for checking! Lemme know if anything pops up, but please enjoy Warsaw first :) [17:12:23] have a good night elukey, I'll keep looking at this