[08:55:13] godog: is there a way to tell grafana to fetch data only from one of the prometheus? (or aggregate both, or similar) [08:56:10] it makes sense that two measurements give different absolute values, but the issue is that we can't seem to show both on the same graph, but instead randomly show one or the other, and that is quite confusing [09:03:16] dcaro: yeah the easiest is to switch to thanos, then you can use 'site' and 'prometheus' labels to select which one you want, results will be joined under the hood [09:03:29] ...{site="eqiad", prometheus="ops"} in your case [09:34:30] 👍 [09:34:32] thanks! [09:41:57] for sure! [10:29:13] Hello everyone. SO I am still working on that SLO dashboard (or rather the rewrite rules for it. [10:29:45] As it turns out, said rewrite rules have a lot of still-interesting labels. e.g. this is what would be useful: [10:30:03] sum by (chart, destination_canonical_service, destination_canonical_revision, destination_service_namespace, le, request_protocol, response_code, response_flags) (rate(istio_request_duration_milliseconds_bucket{prometheus!~".*-ml.*"}[5m])) [10:30:26] (igore the prometheus!~ bit) [10:30:52] The usual convention would make for a _very_ long recorded metric name [10:31:27] I could abbreviate it to `destsvc_rev_ns_le_proto_rc_rf:istio_request_duration_milliseconds_bucket:rate5m` [10:31:55] But I don't know if that is still useful or what other options we have to avoid super long recorded metrics names. [10:33:06] I have also tried to make this aggregation somewhat useful for non-ml k8s/istio services, but I don't know if that is desired or if I should add a label-limiter so that only the ML clusters would record anything (this might be tricky to get right). [10:33:09] Thoughts? [11:28:56] godog: hey [11:29:21] probably late in the day for my question - I had some queries about the esams -> knams move we are planning for next week [11:29:57] we will decom all existing hosts/vms in esams, and then are building replacement hosts / vms at the new site [11:30:50] that will involve a new prometheus host at least, but I guess we need to work through the steps of how we get from A to B [12:54:50] topranks: yo! generally speaking you can provision a new prometheus host in parallel with the new existing one, add the new hostname in puppet alongside the existing esams host [12:54:56] klausman: I'll take a look shortly [12:56:14] topranks: one caveat is that the new host won't have the historical metrics, and it'll accumulate its own metrics as time goes by [12:57:11] godog: ok cool that sounds workable [12:57:28] you are away next week is that right? [12:58:16] topranks: no I'll be here! I'm floating Aug 15th [12:58:31] ok great :) [12:58:43] well with any luck I won't need to bother you much [12:58:45] thanks! [12:59:22] topranks: sure np, please feel free to send the reviews my way [13:07:26] klausman: interesting problem, I don't think we've ran into that before, for the problem at hand are all the labels useful/used for SLO purposes ? if not I'm imagining two recording rules, one for SLOs the other generally useful, that doesn't directly address the "a lot of labels" problem, some I'd imagine can be omitted from the recording rule name like "le" [13:08:10] well, le is the latency bucket, and the interesting one might be different between individual services [13:09:10] As for the rest of the labels, repsonse code and flags could maybe be dropped from the latency metric, since only care about 2xxs there. But we need that for the error rate, of course. [13:09:58] the canon service and namespace are needed to distinguish services from each other. Chart I am not sur eof. revision is good to see in a graph when a service config was updated. [13:10:09] yeah dropping the label from the recording rule name itself, and keeping them in the expression that is [13:10:12] protocol might be dropped, since it's (so far) always http [13:10:22] Oh! yes :) [13:10:35] I msiread your point about le [13:11:28] yeah I feel like this is one case where we can bend the rules (see what I did there?) [13:11:54] Overall, we'd have six rules (three basic ones, but for one each at 5m and 90d). The rules would be the latency buckets, the total req count (for latency bucket percentage) and # of requests, separated by response code. [13:14:16] yeah that seems sensible to me, though folks with more SLO experience than me can be able to provide more insights [13:14:51] I was mostly trying to figure this stuff out on a coarse level before making a complex change that then becomes a quagmire :) [13:17:26] hehehe that's wise [13:38:32] dcaro: poxy.har ! [13:40:51] heheheh, I have some fingers that are lazy, and ends up creating weird typos xd [13:41:40] lol [13:47:43] godog the other question is whether to record the base counter and compute the rate on the query later; or compute the rate in the record rule and query that directly. The former has the advantage of reducing the # of rules for the SLO (we need 5m rate for the SLI graph and 90d for the SLO measurement) [13:54:44] klausman: good question, I'm not 100% sure offhand, though personally I tend to prefer to have the rate already [13:54:51] i.e. the rule is ready to be used [14:02:51] Ah, actually rate-then sum is preferable, because toherwise counter resets get lost. Since we sum in the recording rule, we _must_ rate there as well [14:05:18] https://phabricator.wikimedia.org/T327620#9083810 If put my thought so far here. Feel free to dump thoughts there or here [14:41:50] neat, I'll read up on the task