[15:04:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:09:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [15:17:51] herron: o/ sorry total brain fail [15:17:59] I realized only now the horrible CR that I sent [15:18:04] hope the current one is better :D [15:20:27] elukey: haha no worries, one more comment and then I think its good to go! [15:23:24] herron: I have a doubt though - when I check in Thanos istio_request_duration_milliseconds_bucket I see that multiple time series are reported, one for each le label.. This is why for total I used +Inf, since it should in theory be equal to the total amount of queries.. Is the concern of multiple time series not a problem? (Trying to understand sorry) [15:28:57] elukey: yes you have a good point, ok lets go ahead and try with +Inf and see [15:29:31] super, ok to solve the comments? [15:29:50] just did! 🚀 [15:30:06] thanks! [15:30:45] if this turns out to be ok as well I'll create some basic puppet code to avoid copy/paste every time, since most of the SLOs will be similar [15:31:16] will they be similar enough to use grouping, or different queries? [15:33:00] we could use grouping in theory but I'd need to remove restrictions on the number of time series returned, so thanos will not like it :( [15:33:15] basically now I restrict to only one service [15:33:26] but we have ~10 similar to it now [15:40:11] gotya ok [16:23:14] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:24:29] little follow up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/979126 [16:24:37] Pyrra suggests to use the histogram count [16:24:58] thanks :) [16:25:25] the thanos alerts is me btw, I rolled out a change that restarted prometheus [16:25:29] 👍 [16:28:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [16:36:20] all updated! [16:36:24] so far the new dashboard looks good [16:36:38] I'll check it again tomorrow [16:36:42] with more data etc..