[16:34:34] Hey hello, I have quite a few questions regarding the graphite -> prometheus migration and the effects it has on Grafana. Maybe someone here can point to some doc resources or help out otherwise. [16:35:47] I'm working for the Wikibase Product Platform team and following this patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1039963 we are trying to migrate our Grafana boards from Graphite based to Prometheus based [16:37:25] I made an attempt here and can't quite make the numbers match: https://grafana-rw.wikimedia.org/d/utxtdbTVz/silvan-s-sandbox?orgId=1 [16:40:33] So we are wondering what are possible reasons why prometheus reports much lower numbers than Graphite and how to mitigate [16:50:49] Hi Silvan_WMDE! There are many things that could cause the numbers to not quite line up. Looking at the dashboard in this case, it may be data windowing and how the graph is calculated. [16:54:46] Can you see the settings for the prometheus panel with its two queries [16:54:46] sum(increase(mediawiki_rest_api_latency_seconds_count{path=~"wikibase_v0_.*"}[$__rate_interval])) and [16:54:46] sum(increase(mediawiki_rest_api_errors_total{path=~"wikibase_v0_.*"}[$__rate_interval])) ? [16:55:28] I thought they are easy enough as an example, but to be honest I wasn't even sure if the sum(increase( ... )) approach is accurate [16:56:06] In addition, graphite had a single instance and it was widely known. It's nearly impossible to detect if graphite is getting data from an instance that is not backed by the new ingest instances. This is a risk we're aware of and will be working out as we encounter them in our complex environment. [16:57:25] I see the queries. I can get closer to how graphite calculates them by operating on the rate() rather than increase(). [16:57:38] e.g. `sum(rate(mediawiki_rest_api_latency_seconds_count{path=~"wikibase_v0_.*"}[2m])) * 60` [16:58:10] also, the window is changed in that query ^ [16:58:14] interestingly, I ran some experiments with a well defined number of requests in a well defined time frame, which I expected to show up in the graphs, and even on our Graphite panel not all of them showed up. But the prometheus/thanos queries report even lower numbers [17:02:20] That's not great. :( Still could be windowing and how the graphs are calculated though. Most of the stack can be run locally if you wanted to try it out in an isolated environment. [17:04:39] I think the only component that I haven't tried to run is Thanos. Thanos should be largely transparent though. [17:05:01] ok, but your switch to rate (multiplied by 60) does have an effect already, thanks. I will have to experiment a little more, I guess. [17:05:37] Good luck! :) [17:05:38] Still: are you aware of any easy-to follow docs how to migrate graphite queries in Grafana to Prometheus? [17:06:29] It's quite a bumpy road for people not dealing with such matters every day [17:08:20] Have to leave now, but I may come back tomorrow with more questions about the latency metrics :-) [17:08:44] Not that I'm aware of. There's really no one-size-fits-all approach to migrating Graphite queries to Prometheus. There are a handful of common patterns like `sum(rate())*60`, but each graph and dashboard is bespoke. [17:10:13] ah well, so it's actually some more generic Prometheus knowledge/experience that is required when applying such changes? [17:10:18] Some calculations are handled by Graphite and have to be explicitly calculated by Prometheus. One example of this is quantiles. [17:11:29] I'd say so - I use this all the time: https://prometheus.io/docs/prometheus/latest/querying/functions/ [17:16:09] ah, yes. I think I've been there :-) [17:16:40] also this https://graphite.readthedocs.io/en/latest/functions.html [17:16:40] thanks, probably talk later 👋️ [17:16:59] see you later :) [17:17:06] 🙏️