[07:33:46] dcausse: there's a question about different results on WDQS -main vs full - https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service/WDQS_backend_update/September_2024_scaling_update#Will_-main_be_faster could you have a look? [07:34:11] gehel: sure, looking [07:38:32] perhaps I'm not looking at the right query but both endpoints return the same number of lines (4758) [07:41:36] Yep, I saw that as well. I replied on the talk page. [09:51:05] lunch [12:45:07] dcausse: I came another question regarding CirrusSearch and weighted tags. `CirrusSearch` delegates to `Updater` from that moment on, a `string $tagField` is passed around that specifies, which property of the document is to be updated. Was there ever another property, besides `weighted_tags`, that held weighted tags? [12:47:13] pfischer: historically there were some dedicated fields for tag use-cases but I doubt this method was ever used to populate something else than weighted_tags [12:48:02] dcausse: Alright, thank you! Do we still want to support that use case? [12:48:10] no [12:48:57] Great! So I could get rid of that parameter. [12:49:33] (only if we do not consider `Updater` public API) [12:55:55] dr0ptp4kt: could you review https://www.mediawiki.org/wiki/Wikimedia_Search_Platform/Decision_Records/Search_backend_replacement_technology in light of the recent licensing change? Shoudl we create another decision record, or amend this one? [13:11:08] o/ [13:29:57] pfischer dcausse I worked on a new version of the Saneitizer panel @ https://grafana.wikimedia.org/goto/0DbLoVeIg?orgId=1 . If you have time, could you compare to old ( https://grafana.wikimedia.org/goto/y1VYTVeSR?orgId=1 ) and let me know if new looks OK? [13:30:57] pfischer: nice, yes the Updater class should be consider internal [13:31:05] o/ [13:31:10] per ebernhardson , the "uncategorized" fixes are fixes that have the "problem" label but an empty value [13:31:19] inflatador: thanks! will take a look [13:32:46] weird... [13:43:09] also, feel free to hit me up if/when https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070955/5 are ready for review [13:46:44] \o [13:58:29] o/ [14:02:42] ah unCategorized are datapoints from when there was not this problem label, this graph uses a 1week time window [14:04:46] should be gone by next week hopefully [14:10:12] dcausse: i'm not so sure, even changing the agg to 1h i was still seeing them yesterday [14:10:40] hm the train might have run late yesterday? [14:12:09] they seem gone now with 1h (went to flat around 20:00 UTC) [14:12:34] oh interesting, they do seem to have gone away now. Thats good [14:13:27] it also shows that our primary problem is old versions in the index, not sure if that means fetch failures or whats going on exactly [14:14:09] it's also curious how it seems bunched up, even though the saneitizer is now runs at a consistent and measured speed [14:14:26] (looking at last 24hrs of 1hr agg) [14:16:42] oldVersion is solely via rev_id? [14:17:55] yea, it's only mentioned once in Checker.php [14:20:37] could be some workflow we don't correctly but hard to tell without concrete examples... [14:20:47] numbers are quite different between the clusters tho... [14:21:06] yea the clusters are quite different, although nothing makes them work the same set of pages at the same time anymore [14:21:26] yes, I suppose the sanitizer loops got desynchronized between those cluster that it's hard to compare [14:22:24] i suppose should add something that logs the exact pages so we can do further analysis. I was kinda hoping this would all be redirect in wrong index errors [14:23:12] do we log the page_id's for fetch failures in SUP? [14:23:32] wondering if will be able to differentiate between fetch failures and "didn't try" [14:23:54] yes: https://schema.wikimedia.org/repositories//primary/jsonschema/development/cirrussearch/update_pipeline/fetch_error/current.yaml [14:24:17] oh neet, somehow i didn't realize we had a side output for that [14:32:20] inflatador: I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070955 should be ready for review [14:41:13] gehel: i think we ought to use the same page and amend it once we have the critical details: (1) license input (requested, they have a typical SLA of 5 day response time, but i imagine it'll take another week after first response, maybe somewhat longer), (2) verification of embedding capabilities, (3) team consensus, and finally (4) verification with management chain plus SRE-at-large [14:41:21] (whatever the proposed direction is). we should devote 5 minutes of discussion to (3) - i think i know where we're at here but it would be good to re-confirm understanding. for (2) i think we need a "research spike" if (3) is a settled consensus. (1) could conceivably complicate things. (4) i think depends on all of (1), (2), and (3). [14:41:31] as far as where the decision right sits, i believe it is still the case that this is a local decision for the team's bounded contexts...although obviously this does not exist in a total vaccuum here nor in the broader software ecosystem. [14:43:52] friends, i'm pretty preoccupied with Data Platfrom strategy industry landscape review and drafting stuff...i'm trying to keep up with the backscrolls, but do let me know if there are hot potatoes (including time pressure code review) [14:44:46] dcausse I didn't see patchset 6 before merging but everything looks OK [14:45:00] I had already reviewed PS 4 and 5 before then [14:46:06] ps6 seems to be a rebase [14:50:02] oh...I guess that happened automatically when I merged [14:50:18] ah yes might be it [15:38:24] heading out, have a nice week-end [15:48:24] dr0ptp4kt: fwiw i'm mixed on which direction to go ... i have concerns both ways :P [15:50:33] peter (wikidata community member) was asking if there's a way to get the Wikidata Query Service to report the total time taken to evaluate a query [15:51:13] i don't think we do any queueing, isn't it just the wall clock? [15:51:15] The UI reports something like `48 results in 404 ms` which I think is exactly what he's asking for...only thing I'm unsure of is if that's just the time for the backend to respond or includes intermediate stuff (caching layer etc) [15:51:57] i suppose arguably could be looking for cpu time vs wall time, which might be higher since iiuc blazegraph can use multiple cores to serve a query, but unsure [15:52:11] at least, it sure uses a lot of threads :P [16:07:46] unrelatedly...some editor patterns (that are totally reasonable!) make debugging things tedious :P This editor creates a page at `User:/sandbox`, edits the page a few times, then moves it into the main namespace. But that makes tracking backwards through what happened difficult [16:30:28] Peter asked on my talk page as well. I'm not sure what he is looking for. The UI displays the running time, or he could just time the request. [16:30:55] We don't have a measurement that includes parallelization afaik [16:31:05] On that note, weekend time! [16:52:59] back [17:23:10] The time reported by WDQS frontend is just however long it takes to retrieve the result. If it’s cached it’s really fast, of course. [17:23:36] If there’s a delay for network reasons, that counts in the timer as well [17:33:45] hmm, joining the most recent cirrus snapshot (taken 2024-09-01) against the mw sql dump (2024-08, no clue what day), find potentially 77k incorrectly indexed redirects. But of course with the mis-alignment between dumps could be 0 [17:35:17] maybe the frontend rendering code should count redlink redirects and increment them into prometheus, to give a better realtime guess at how often we have bad redirects indexed [18:07:44] very not clear where that kind of code might go :S [19:03:35] realized we excluded labswiki, but not labtestwiki from SUP producers. Unlikely to get events over the weekend, but went ahead and deployed a fix anyways [22:31:34] cloudelastic just had to have problems near end of day on friday :S [22:44:00] no clue what was wrong...but i restarted consumer-cloudelastic and it seems to be consuming again [22:58:11] hmm, for some reason the elasticsearch sink for cloudelastic had went from 1-2s per update to 30-40s, but again not sure why [23:08:56] and somehow the saneitizer for cloudelastic stopped :S [23:08:59] (since the restart) [23:09:36] oh well, it's fine if saneitizer is broken for a weekend...g'night!