[07:34:13] mutante: that's correct yeah [07:34:30] XioNoX: ok! I'll take a look later [11:51:52] hi, i have slightly enhanced the Apache Access log dashboard at https://logstash.wikimedia.org/app/dashboards#/view/825c5c80-8aef-11eb-8ab2-63c7f3b019fc [11:52:14] notably to display the top remote client IP (which is found from the X-Forwarded-For header when mod_remoteip is enabled) [15:47:41] godog: thanks! I wanted to know if I can just repeat that line to get output in 2 or more channels, and turns out I can, works :) [16:51:56] fyi all i have merged https://gerrit.wikimedia.org/r/c/operations/alerts/+/903687 to try and reduce some of the noise [17:02:54] thanks jbond [17:14:09] \o so I'm trying to add a new field to grizzly `slo_definitions.json` that defines a human-undestandable string for the respective services' SLI, and then have that field get plumbed through to `slo_template.libsonnet` [17:14:12] I'm having a bit of trouble understanding how it currently works though. In particular I'm trying to understand where `title` originates from here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/grafana-grizzly/+/master/slo_template.libsonnet#37 [17:15:26] The title seems to match this field https://gerrit.wikimedia.org/r/plugins/gitiles/operations/grafana-grizzly/+/master/slo_definitions.libsonnet#114 but I don't quite understand how it knows to refer to that by `title` [17:19:58] mm I guess the actual answer wrt `title` doesn't matter. I think I can just define a new `sli_panel_title` in `slo_definitions.json` and then have `slo_template.libsonnet` refer to `slo.sli_panel_title` [17:21:39] ryankemper: there is a chain of functions calling each other there and title is a param. the slo definitions are looped over and passed into rowGenerator, which passes them along to panelGenerator [17:25:37] but yeah that's the right idea, essentially line 114 would become the name of the dashboard and line 116 would become the name in the row/panels. you can see they differ in some other SLOs [17:40:53] herron: I threw up a quick and dirty patch here: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/903731. Something's wrong with my approach though because `grr preview` bombs out. Error output follows: [17:40:56] https://www.irccloud.com/pastebin/h24TfLc3/ [17:41:20] let's see [17:44:04] * ryankemper wonders if maybe it just doesn't like trailing commas in slo_definitions.json [17:44:22] .jsonnet* [17:44:35] ryankemper: would changing line 123 to something other than WDQS do the trick? [17:45:24] "WDQS Request Failure Rate" should become "WDQS Request Failure Rate SLI" in the panel [17:47:26] in other words I'm wondering why we need to set the sli panel title specifically, since the title should match the slo panel too [17:49:02] herron: one issue is that that title field get used for other stuff. For example the dashboard description would then say `See the complete WDQS Request Failure Rate documentation here: https://wikitech.wikimedia.org/wiki/SLO/WDQS` which is a bit awkward [17:49:11] agreed on the slo vs sli point [17:50:08] Let me back up a bit. If we glance at a random grizzly dashboard - I'm going to use a non-WDQS one for this example: https://grafana-rw.wikimedia.org/d/slo-Logstash/logstash-slo-s?orgId=1 - currently the "SLO" dashboard titles are pretty understandable since every service is specifically using an error budget metric [17:51:03] By contrast, an SLI panel title like `Logstash Latency SLI` is a bit vague. I know that the graph is going to be something about a latency SLI, but the title isn't really telling me that, for this service, the SLI is specifically a failure rate [17:51:55] So my thinking was it'd be nice if the service owners had a field that they could define to set the panel title to something a bit more descriptive, like "Logstash Request Failure Rate" instead of a more vague "Logstash Requests SLI" [17:53:37] If we could be guarantee via convention that it'd always be some type of failure rate, we could just change the template in a similar way to how the "SLO" panel is called "$serviceName Error Budget Remaining". But idk how realistic that guarantee is given all the possible types of SLIs that a service owner might go with, thus the idea to just have an additional field that could be defined [17:53:54] ok I think I'm with you, fwiw the SLO and SLI panels share the same prefix since they are meant to show the same information in to ways. i.e. here's the budget status, and here are the indicators/events that yeilded that budget [17:54:18] but in both cases it'd be for "logstash latency" for instance [17:55:00] WDQS I think is just missing a descriptive term, currently it's only the service name while elsewhere we'll say requests, latency, combined, etc. [17:56:50] The point on WDQS having a more descriptive term absolutely makes sense to me. As for the second half of the issue, we're still left with the (perhaps minor) issue of names like `Logstash Latency SLI` / `Logstash Requests SLI`, lacking the description of what the actual SLI is [17:57:25] in other words there's a lot of possible different SLIs. It could be the request success %, request failure %, % of requests that are successful and returned within xx ms, and so forth [18:00:09] got it, yeah it's sort of intentionally vague in order to support any type of SLO we wanted to define, but yeah maybe we can add a panel details field too. you are thinking something more than we're already doing in the description field? [18:03:19] we could also rename from "Error Budget" to "SLO Budget" across the board I think. We've been referring to that as an "error budget" but that's confusing since error could mean several things [18:03:47] it's more of an exception budget than error specifically [18:06:09] I think description is 80% of the way there, but feels like we're missing a field called something like `SLI_type` that would be a string like `Request failure rate`, which could be plumbed through to the panel title [18:06:31] For the few services that have descriptions, their descriptions are actually phrased in terms of the success state, but the `sli_query` is actually given as a failure rate [18:06:51] so if one were to just read the desc in the generated dashboard I think it would be easy to come away with the wrong impression [18:07:35] Part of this might be that generally `SLI` - at least afaik - is more commonly the positive and not the negative metric, but here all of our services use the "negative" metric [18:09:38] Like I think a common pattern is the SLO to be something like `five 9's of availability` and then the SLI is `availability %`. But we're by convention (we = wmf) doing the SLI as `1 - availability %` based off what I see in `slo_definitions.libsonnet` [18:12:42] So yeah basically in the current state of things, if I glance at say the trafficserver slo dashboard and read the description as `Trafficserver uses one combined latency-availability SLI: A response is satisfactory IF it spends less than 150 ms processing time in Trafficserver, AND it isn't a Trafficserver internal error`, and then I see a graph titled `Trafficserver Combined SLI`, [18:12:49] yeah in terms of a graph anyway the SLI is usually easier to read when spikes relate to an incident or problem [18:12:50] if I didn't know better and hovered over to see values of like `0.0015%` my first instinct would be "oh my god there's an outage the SLI is almost at 0%" before realizing that the SLI is really the failure % and not the success % [18:13:43] herron: yeah I think that definitely makes sense. I think I'd just want the title of the graph to convey that, that it's for ex a "% of requests failed" and not succeeded [18:15:28] thus the motivation for some type of field to stick that context into and have it get plumbed through to the eventual graph that shows [18:18:23] to be clear specifically a `slo.sli_panel_title` is a bit ugly, it'd probably be better to have like a `slo.sli_type` with a string like `Requests failed %` and then the eventual sli graph title would be `$title SLI - $slo.sli_type` giving a result like `WDQS SLI - Requests failed %` [18:22:56] ok, but I'm wondering if we could change "WDQS" to "WQDS Requests failed", since the dashboard is already called WDQS we can add that detail to the per-slo title [18:23:34] we can add more fields too, but this SLO has dashboard name "WDQS" and slo name "WDQS" which we could expand upon first [18:26:42] as an example https://www.irccloud.com/pastebin/6YvGoBu4/ [18:34:43] for wdqs specifically that would improve things, yeah. that title is a bit overloaded though, other parts of the template assume the title is gonna be a more generic name like "WDQS Uptime"; that's easiest to see in the top box that would now say `See the complete WDQS Requests Failed documentation here: https://wikitech.wikimedia.org/wiki/SLO/WDQS` whereas it would be a bit better for it to say `See the complete WDQS Uptime..` [18:36:14] maybe this goes back to the positive/negative distinction though. I think for search team we think of it as a `WDQS Uptime SLO` with the real SLI being uptime, not failure %. But to match the conventions that other services used we kept the `1 - uptime` pattern for the sli query. So maybe we just want the title to be `WDQS Uptime` and have the SLI query encapsulate the uptime not the failure [18:40:23] tl;dr: with the current assumptions built into the template, if we had a title of `WDQS Uptime` and the SLI query directly representing `Uptime` everything would be as we want it. But I still think that other teams' dashboards are still a bit unclear in the current state of things, and part of my motivation was to see if there's a sane way to refactor things that handles that as well [18:44:48] herron: anyway, really appreciate your time! I'll need to think on this stuff and play around with a couple different approaches and see if I can come up with something coherent :P [18:47:17] ryankemper: for sure! sounds good, and yeah open to additional titles too, happy to have another look at that patch when ready. I think in general the more general purpose and applicable to all/most panels the better [18:48:28] herron: oh, one last thing. the initial error I posted way above is actually an error I'm seeing with latest master and not an issue with the patch itself as I originally assumed. So I think something's broken with current master [19:14:21] ryankemper: ok I'll have a look at that [19:15:54] ryankemper: hmm it works on my machine™ [19:17:26] herron: weird, I'll need to take another look later. maybe there's something about `grr` I'm doing wrong. Here's what I see on my end though: [19:17:32] https://www.irccloud.com/pastebin/SpkSuirC/ [19:18:01] (`e5a525bc...` = `Grizzly: update grafanaDashboards objects to hidden field`) [19:18:34] ohhh, try like `grr preview slo_dashboards.jsonnet` [19:19:55] I see that too when calling without the filename, what an excellent argument missing error that is 🙃 [19:32:28] * ryankemper slinks away in shame [19:32:35] of course!