[08:46:02] mornin' [08:46:50] godog: did you see my question re: `site` and `prometheus` on the SLO/Rec-Rule change (it's on the last open code review comment) [09:22:44] klausman: good morning, oops! my bad, I'll reply now [09:23:19] thanks! [09:23:30] (and no worries, it wasn't really obvious a spot_ [09:23:59] sure np, but yes you need 'site' [09:26:51] I'll also add prom since it's useful at least for LW (staging vs. prod in codfw for example), and I suspect it will be useful for others as well [09:27:14] ah yeah that's true, fair point [09:31:50] Thanks fore the review and generl discussion! Once I merge the change, will the config be deployed automagically, or anything I need to do beyond puppet-merge? [09:37:24] the latter, nothing required other than puppet-merge [09:45:39] Grazie mille :) [09:46:35] and done [09:48:00] \o/ \o/ [14:18:45] How do I debug grr failing with: Non-200 response from Grafana: 500 Internal Server Error [14:18:59] There's nothing obviously wrong with my change. I think. [14:21:03] klausman: could you paste the full command/error output? [14:21:10] sec [14:24:24] $ grr preview slo_dashboards.jsonnet [14:24:26] Non-200 response from Grafana: 500 Internal Server Error [14:24:29] That's all she wrot [14:24:33] +e [14:24:47] looking [14:25:47] hmm, seems ok when I try, are you running on grafana1002? [14:25:55] yep [14:26:18] but I am working with a checkout of https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/952226 [14:26:25] I am pretty sure my change is otherwise ok [14:27:37] ah, yeah I'm reproducing that now [14:27:58] Theres some whitespace grunge going on, but quickfixing that didn't help [14:36:36] klausman: looks like invalid uuid, I think due to spaces in the top level element for that slo dashboard logger=context userId=0 orgId=1 uname= t=2023-08-24T14:35:24.848825393Z level=error msg="Invalid app URL" error="invalid dashboard UID" remote_addr=127.0.0.1 traceID= [14:37:22] ah. lemme check [14:37:51] yep, s/ /-/g fixed it [14:37:57] Thanks! [14:38:07] the two slos could be placed under a single "Liftwing" element (currently theres two, starting with "Liftwing-Revscoring Latency"), that'd give two panels on a single dash instead of two dashboards with one panel each [14:38:15] np! [14:39:07] There is also the questions how to organise the future non-revscoring LiftWing SLOs [14:39:29] Is there a particular reason why the dropdowns at the top (site, cluster) are inactive? [14:42:48] ah, maybe a dashboard for liftwing-revscoring in that case? depending if you wanted all liftwing slos on the same dash, or broken out into multiple dashboards. [14:43:36] yeah it's all a bit messy since there are so many different services in Revscoring that e.g the remaining budget widget is basically unreadable [14:44:23] https://phabricator.wikimedia.org/F37626690 [14:44:36] there are label queries for site and cluster, site is a global default so far in slo_template.libsonnet, and cluster can be overridden in slo_definitions.libsonnet (default value in slo_defaults.libsonnet) [14:45:52] the graphs on the right side also seem to have exactly one datapoint in them, which is odd, since the SLIs queries in the config produce normal-looking graphs on thanos [14:47:11] well the good news is that there are 100 100%s :) [14:48:10] I think I improved things a bit by having a sum by namespace rather than canonical service and namespace [14:48:45] Now the budget widget is readable, but there's still only one datapoint [14:49:47] https://phabricator.wikimedia.org/F37626723