[08:48:30] I have a weird behavior on grafana.w.o, using the eqiad/ops prometheus datasource, this graph: https://grafana-rw.wikimedia.org/d/UUmLqqX4k/openstack-api-performance?orgId=1&refresh=30s&var-cloudcontrol=cloudcontrol1006&var-cloudcontrol=cloudcontrol1007&var-backend=keystone_public_backend&from=now-7d&to=now [08:49:00] gives me two different versions depending on refresh (just clicking refresh on the top right or waiting for it to refresh itself) [08:49:11] any idea why? the values are very different [08:49:30] https://usercontent.irccloud-cdn.com/file/pvgjbVYm/image.png [08:49:42] vs [08:49:44] https://usercontent.irccloud-cdn.com/file/HMQtCQxQ/image.png [08:56:16] forcing a 30m minimal interval in the query options of the graph did not help (in case it might be an artifact of displaying a very detailed, very spikey value in a smaller area) [09:08:23] ack, taking a look dcaro [09:19:09] dcaro: I'm wondering if the haproxy instance you are looking at needs a fix like https://gerrit.wikimedia.org/r/c/operations/puppet/+/943506 ? [09:19:53] how would that give two sets of data on prometheus? [09:20:03] (though yes, it might need that fix too xd) [09:21:41] in eqiad prometheus is backed by two hosts, in this case my best guess so far is that each has a different "view" depending on which haproxy stats it got connected to the first time [09:21:53] oh, that makes sense [09:22:04] hm [09:22:35] that would mean also that there's two instances of haproxy running on that cloudcontrol1007 hosts no? one for each set of stats (at least) [09:23:57] heh I'm not exactly sure how haproxy and its stats mechanism work tbh [09:24:35] https://www.irccloud.com/pastebin/XyClC0AM/ [09:24:45] there's two processes at least :/, not sure if there should be [09:25:19] yeah IMHO the easiest test given what we know is disabling KA and see if you still get different results [09:25:33] KA? [09:25:49] keep-alive like in the patch above [09:25:53] aahhh, okok [09:26:18] hmm, wouldn't that make it randomly pick one of them? [09:26:30] (so still getting duplicated data, now mixed on the same prometheus host) [09:28:22] my (limited) understanding is that with keep-alive enabled the haproxy stats process basically never quits/gets refreshed [09:28:42] in other words there will be new stats workers every time [09:28:59] which should give consistent results (?) wild speculation at this point tbh [09:40:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/947325 [09:40:29] quick review? (to make sure I did not mess up xd) [09:41:10] dcaro: which HAProxy version are you running? [09:41:21] CR looks good at least for 2.6 [09:41:27] HA-Proxy version 2.2.9-2+deb11u5 2023/04/10 - https://haproxy.org/ [09:41:50] https://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-http-after-response [09:41:52] looks good too [09:42:02] it didn't solve the issue for us BTW [09:42:09] xd [09:42:26] were you able to solve it? [09:42:31] nop [09:42:35] hahahah [09:42:37] okok, :) [09:42:46] querying helps a little bit [09:42:54] aka using irate() rather than rate8) [09:43:04] but for fresh data we still have some issues [09:43:09] not exactly the same high level issue either, there's no rate() involved in this case [09:43:12] in our case.. heavy drops [09:43:15] yep [09:43:29] but good to know anyhow [09:43:46] what made me try getting rid of KA is that prometheus opens a TCP socket and stays there till HAProxy gets reloaded [09:44:22] so just 1 connection.. and considering that the prometheus exporter gets the socket migrated it's hard to tell when it stops from scraping metrics from old daemon to the new one [09:47:49] we seem to be getting two different sets of data, one on each different prometheus server [09:47:56] oh, I seem to be able to reproduce locally [09:48:13] two different curls gave different data (from the same host, curl http://127.0.0.1:9901/metrics | grep haproxy_backend_http_response_time_average_seconds) [09:48:31] I'll open a task to follow up the debugging [09:49:48] mmhh ok nevermind then, clearly not the k-a issue (?) [09:53:37] hmm, not sure though, the data there jumps quite often :/ [09:55:22] is there a way for me to select which prometheus host to get the data from? So I can compare both? [09:57:19] mmhh not by default, you can port-forward via ssh the prometheus web ui from each host [09:57:26] brb [09:58:05] hmm, if I zoom in the graph (hit the 'view') then I see a spikey graph that does not seem to change on refresh [09:58:07] https://usercontent.irccloud-cdn.com/file/7SANVR2v/image.png [09:58:22] so maybe it's about the resolution? [10:03:53] if I force the graph to have 2880 points or less, I see the issue, if I force it to have 2881 or more, it becomes the spikey one without the issue (that I could notice) [10:05:10] maybe I should use some aggregation function instead of the raw values? [10:14:22] mmhh interesting re: tweaking the data points and the issue [10:22:44] oh, with 4000 datapoints the graph is spikey, but still getting different values (the max changed at least) [10:23:50] hmm, even with 10k points the max jumped sometimes from ~11s to ~8s [10:26:14] using max_over_time or average_over_time gives different values too, so using an aggregation there did not help [10:27:04] those are evaluated on the backend/prometheus right? so that'd mean that I'm getting different values from prometheus already? [10:29:15] yeah that's right dcaro [10:31:54] what would be the prometheus hosts service "eqiad prometheus/ops" datasource? [10:32:12] that'd be prometheus1005 and prometheus1006 [10:32:43] port 9900 then localhost:9900/ops [10:32:54] oh, okok, I was playing with port 9906 [10:34:00] ah yeah that's the k8s instnace [10:35:57] stats are hard xd [10:37:50] heheh seriously [10:38:33] how do I know if the data that each has (that is different than the other) is the "same", as they are measuring at different times getting different values for the same statistic, so some variation is expected :/ [10:40:46] yeah the expectation is that metrics fetched don't change that much between prometheus scrapes, and the values are basically the same for all intents and purposes [10:41:09] why haproxy returns wildly different values for each scrapes, I have no idea [10:59:14] hmm, we have a prometheus exporter for haproxy, but haproxy supports exporting prometheus stats already [10:59:39] (not that it would change the values we get) [11:02:06] mmhh which one is used in this case ? [11:04:21] I think the exporter, as the name of the metrics are slightly different, I don't find the one we use 'haproxy_backend_http_response_time_average_seconds' in the metrics returned directly by haproxy [11:04:38] we should probably switch to fetch them directly [11:07:44] interesting, now I'm wondering what we do for the rest of haproxies in the fleet, but yeah when the software natively supports prometheus that's what we should be using [11:20:05] it seems that some have direct haproxy stats (like cache and such), but the default haproxy jobs use the exporter (thumbor, dns, toolforge and openstack xd) [11:21:58] ack! thank you for taking a look [11:29:24] the built-in exporter is a relatively new thing, I think we should eventually transition to it from the external one [12:46:02] sounds good to me [12:59:56] Created T343885 to keep track [13:04:26] dcaro: cheers <3 [13:05:18] hm :/, there's the issue that the names of the metrics will change though, so I can't just swap them withoun nobody noticing [13:05:34] should I enable both at the same time for some time? [13:06:06] yeah that should be fine, you are right you can't swap them as-is