[06:14:01] it seems the migration of the last wdqs server has just finished :) [07:09:05] cool! :) [07:20:39] dcausse: is it expected that the streaming updater hosts have a bigger `wikidata.jnl` (~1.1TB) than the old updater? (~975GB) [07:20:50] guessing it's expected (maybe related to skolemnization differences?) but just wondering [07:21:55] ryankemper: not expected... but indeed skolemization might perhaps explain this [07:22:43] I should have taken a closer look just after the import [07:23:29] when looking at it couple weeks ago the free allocator seemed to decrease at a slower pace than the old updater [07:32:32] ryankemper: thanks for pushing it to the end! [07:34:01] dcausse: I'm supposed to givr a lightning talk on search APIs next All-tech, I was thinking maybe I replace it with Streaming updater LT ? I can do search APIs LT probably next time, wdyt? [07:34:15] (also no way I'm doing that alone :) ) [07:35:17] zpapierski: fine by me :) [07:36:18] ok, let's discuss the LT composition tomorrow [08:45:54] dcausse:apparently slo dashboard isn't super complicated - https://engineering.bitnami.com/articles/implementing-slos-using-prometheus.html [08:46:18] I haven't much read into that [08:46:47] actually, I think I will, I didn't do much of playing around in Grafana [08:46:50] (I think) [08:46:58] but for now, meal break [08:55:59] zpapierski: interesting, if we can reuse existing tools it's awesome [09:47:43] o/ Does the query service updater have a way to be pointed to a server / service with a host, and pass a HOST header for the site it actually wants? [09:48:07] ie, make requests to a service called "mediawiki" on port "1234", for site "mywiki.domain.com" ? [09:50:18] addshore: the updater or the sparql endpoint? [09:50:35] not sure I understand what you want to do if mean the "updater" [09:50:38] the updater, for rdf retrieval [09:50:57] and if not, any idea where in the magical code we should try and add such a thing? [09:51:25] so you would like to force the updater to target a particular MW api server? [09:51:31] yup [09:51:34] ok got it [09:51:49] is it for wmf install or wbstack? [09:52:17] actually it might be the same code... [09:53:07] wbstack [09:53:40] right now we are on org.wikidata.query.rdf 0.3.84 [09:54:14] addshore: I think the code is mostly there it's just a new option to pass, trying to find link to example code [09:54:32] That encouraging :D [09:54:34] *that is [09:59:15] addshore: good news it's already done :) [09:59:35] party party party! [10:02:20] addshore: add this to the updater cmdline: -Dorg.wikidata.query.rdf.tool.wikibase.WikibaseRepository.proxyMap=hostname1=https://targethost1:port2,hostname2=https://targethost2:port2,... [10:03:17] thanks! [10:03:40] for us it looks like: www.wikidata.org=http://localhost:6500 [10:03:51] http://localhost:6500 being the envoy listener [10:04:12] nice! [10:04:20] many many thanks [10:04:37] yw, happy to help doing nothing :P [10:04:44] ;D [10:10:16] lunch [11:46:11] if I'm using Grafana correctly, we spend 88.9% of time below 5min lag [11:46:23] over the last year [11:47:26] (at least in codfw) [11:58:46] I might use an incorrect query, though, here's what I use https://www.irccloud.com/pastebin/WU0ovijx/ [12:05:30] I get 0.72 on thanos (codfw+eqiad) for the public cluster [12:05:52] 0.96 for internal cluster [12:07:52] hard to tell if the query is correct tho [12:10:21] hard to tell if the query is correct tho but the diff between wdqs and wdqs-internal makes sense [12:14:31] I should account for pooled/depooled status, joal suggests using number of request for host / total requests as a weight [12:16:09] good idea [12:24:40] metric could be the number of times a query hit a server with a lag > 10m [12:29:53] yeah, I think that shows a real life impact better [12:29:57] lunch [13:40:38] dcausse: so I understand thanos is the way if I want to have metrics from both DCs? [13:41:03] zpapierski: thanos will have historical (less precise) data [13:41:23] is there an other way of having both DCs at the same time? [13:41:24] while grafana only has at most 6month of data [13:41:31] ah, I see [13:42:06] so I think thanos should be prefered for this use case: set a baseline for 2020/2021 [13:43:08] ok, I'll use that [14:12:57] mpham: don't quote the numbers yet, I need to triple check the calculation + this doesn't take into account pool status - https://grafana-rw.wikimedia.org/d/yCBd7Tdnk/wdqs-lag-slo?viewPanel=2&orgId=1&var-cluster_name=wdqs&var-lag_threshold=600&var-slo_period=1d [14:14:48] I'm working on having a more subjective metric that takes into account how heavily requested the instance is - basically working with the assumption that if the the more instance is queried, the more important the lag is [14:15:18] dcausse: is there anything in prometheus already that would allow me to gauge the pooling status at the given time? [14:16:09] zpapierski: no clue but I doubt that this info is available there [14:16:21] I thought as much :( [14:16:32] but maybe requests will be enough [14:47:16] ok, so I don't know how to do that requested instance weight yet, but for now I hacked this: [14:47:17] rate(blazegraph_queries_done_total{cluster="$cluster_name"}[5m]) < bool 1 == 1) [14:47:55] ah, no, that one's broken [14:50:08] I suppose that should be enough [14:50:08] rate(blazegraph_queries_done_total{cluster="$cluster_name"}[5m]) < bool 1 [14:50:34] this shows 1s for the hosts that had more than 1req/s, 0 otherwise (I hope) [14:50:47] I'm not super sure how to connect it with the other query, though [14:52:25] I guess this one will be more useful (it erases values for not used): rate(blazegraph_queries_done_total{cluster="$cluster_name"}[5m]) > bool 1 == 1 [14:52:58] I think you can multiply using vector matching: https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching [14:53:18] I believe you need to multiply anyways because the updater is making queries [14:53:29] so even depooled servers will see some queries [14:54:08] I assumed as much, but I guess those will be the hosts with < 1rps? I need to cut them off somewhere [14:54:27] I think that multiplication is what I wanted, thanks! [14:57:48] I saw that the last server has been transferred for the new streaming updater! Awesome job everyone! Super exciting! I'll prep some announcements for the community as well as internally [14:58:25] zpapierski: thanks for the SLO dashboard [14:58:56] it's only a beginning, but it seems we should be got with what we have rn [15:03:07] mpham: piece of explanation for the parameters - the calculation is made for the duration in lag period variable and for the newer edge of the time range selected. I'm not sure how to create one that just uses the time range, I'll try to figure it out later [15:03:57] ah, the variable name is actually "Period to calculate" [15:07:26] right now we've stated our KR as having <10min lagtime. it seems like maybe we should modify it to be more of this SLO style of being above x% belowtime? [15:08:07] true [15:09:20] though I'd do that once we settle on how we calculate that - metric I choose as a start isn't a very good one, since it doesn't account yet for pooling status [15:11:45] gotcha. makes sense [15:21:45] second, adjusted metric isn't probably yet correct, but I'm getting too tired and too stupid to understand PromQL - https://grafana.wikimedia.org/goto/kpZ1dTd7k [15:22:38] I calculate SLO per instance that's active and assume the lowest score as the current one (at least that's my intention) [15:23:07] I'm not sure you can easily indentify active instances like that [15:31:13] for instance wdqs1013 was depooled yesterday 7:30 and repooled at 12:00 [15:31:15] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221634539084017%22,%221634572683685%22,%22eqiad%20prometheus%2Fops%22,%7B%22exemplar%22:true,%22expr%22:%22rate(blazegraph_queries_done_total%7Bcluster%3D%5C%22wdqs%5C%22,%20instance%3D%5C%22wdqs1013:9193%5C%22%7D%5B5m%5D)%3Ebool%201%22,%22requestId%22:%22Q-a981e2d4-ba3d-45f2-a1ca-4d1422cbd2b5-0A%22%7D%5D [15:31:18] it still shows as active [15:31:20] \o [15:31:23] o/ [15:32:54] o/ [15:34:27] dcausse: interesting, do you know why it has such a high rps - updater? [15:34:46] the old updater is making a lot of queries [15:34:53] the new ones is making queries too tho [15:34:59] but a lot less [15:35:33] hmm, so rps might be useless [15:36:29] unless we could eliminate updater's UA... [15:40:02] it can still be pondered if it multiplied somehow [15:40:25] certainly won't be accurate [15:40:45] can't we subtract rps generated by updater's UA? [15:45:09] the old updater varies between 2 and 5 rps, the new one is around 0.7 rps [15:47:26] getting something more precise would involve spark and the wdqs query log [16:42:03] * ebernhardson wonders if oath auth could have a worse name [16:42:13] (completely unrelatd to oauth, of course :) [17:01:32] not sure how I ended on this page but it started from trying to understand what oath toolkit is about, https://en.wikipedia.org/wiki/List_of_company_name_etymologies (did not know it existed) [17:04:03] dinner