[10:59:14] godog: https://github.com/prymitive/karma/pull/5086 just got merged, superfast :), any idea on how/when can we get it on our side? (no rush, just curious) [11:00:07] dcaro: sweet! thank you for working on that, can confirm upstream is pretty great [11:00:36] after a release is made then it is reasonably easy/fast to update our debian package and deploy [11:29:47] 👍 [11:30:30] godog: Wow.... https://github.com/prymitive/karma/releases/tag/v0.113 already released [12:17:18] haha! Lukasz is awesome [12:18:23] dcaro: so two ways forward I can see: if you have time/bandwidth and want to take a stab at packaging the new version that's good, otherwise I'll take a look at it next week [12:19:48] It's not urgent, so I might not find the time, do you have the code/scripts you used before? [12:21:45] dcaro: yes, code is at operations/debs/karma and instructions in debian/README.source [12:22:54] Ack, I might give it a go :) [12:23:19] a ... go lang [12:23:20] https://commons.wikimedia.org/wiki/File:Sting.ogg [12:23:27] * godog grabs coat [21:55:41] hey o11y, I'm looking at the wdqs SLO metric performance and could use some help understanding the options available https://phabricator.wikimedia.org/T328306. thus far the metrics have been isolated into their own group and the evaluation interval was increased to 4m (https://gerrit.wikimedia.org/r/c/operations/puppet/+/884906/2/modules/profile/files/thanos/recording_rules.yaml) [21:56:16] Per the grizzly dashboard (https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1) it looks like it's still having trouble keeping up (based off the discontinuities present in the SLI panel) [21:57:23] So with that all being said, what can be done further? If it would help we have room to further increase the evaluation interval, since we care more about the overall trend than getting super-up-to-date metric output. Trying to drill down to some concrete questions: [21:58:35] (1) If we further increase the eval interval, could that hit a tipping point where it worsens performance because while the "batches" are farther apart there's more work to be done to sum up across the increased interval? Or is that not a concern? If not then an interval of, say, 15 minutes might be a good thing to try next [22:04:46] There was going to be a question (2) regarding the ticket mentioning that we can `reformulate these rules in terms of lower-cardinality and pre-aggregated rules`, but I think I may have rubber-ducked myself so I'll leave it at just question #1 for now while I look into `job_backend:trafficserver_backend_requests:avail5m` :P