[08:09:37] Zbyszko did a talk at Haystack on Tour: https://www.youtube.com/watch?v=SYwZnuQB6hY [08:46:46] nice! reminds me of the simple morelike query we use to find similar pages [08:58:36] pfischer: I'll be 3' late for our 1:1. Bad timing on the coffee [09:01:14] No worries [09:54:08] Trey314159: this might be a question for you: https://www.mediawiki.org/wiki/Topic:Xgxxizo3t0h7nxsa quickly looking ᾴ gets normalized as άι by both the icu_folding and icu_normalizer token filters [09:58:12] Lunch [12:30:56] o/ Hi! I’m currently working on ingesting redirects in the cirrus-search-update pipeline (https://phabricator.wikimedia.org/T325315). In particular, I’m looking into deleted revisions of redirects. I already fixed QueryBuildDocument.php to gracefully handle missing revisions and combined the query with prop=deletedrevisions to get deleted revisions including their content (so I can parse/determine where they were [12:30:56] redirecting to). However, that only works if the user making the GET request has permission to see deleted revisions. Now I’m wondering: Can we guarantee those permissions when making requests from within the cirrus-streaming-updater (flink) where we do not know the target wiki upfront or do we have to handle deleted revs inside the CirrusSearch extension? [12:33:09] pfischer: hm.. I doubt that we'll create an account for the flink app [12:34:02] such perms require an account and the perm to see deleted revision content will require approval from individual admins [12:35:09] Alright, though so... I’ll look into a CirrusSearch-internal solution then. [12:38:47] pfischer: yes doing this on top of the page-change stream seems hard... [12:44:10] pfischer: also deletedrevisions is I think made to access "suppressed" revisions which is "rare", past revisions content are usually public [12:45:37] now I'm wondering how cirrus is handling the case of a redirect changing its target [12:47:06] not seeing anything obvious in the code base yet [12:48:21] https://www.irccloud.com/pastebin/Mizu8wJl/inconsistencies2023-04-26 [12:49:07] I intend to restart the download and would like to know which URLs i need to use for the download. [12:50:11] seppl2023: dumps.wikimedia.your.org has stopped to mirror wikibase entities dumps [12:52:24] seppl2023: looking quickly over mirrors you might find them on https://datasets.scatter.red/ or simply using dumps.wikimedia.org directly [12:53:16] o/ [12:55:39] I see so this is a cut&paste problem from adshore's 2019 description. I never used these your ... links before and didn't even realize i was doing it assuming that the links by adshore were somewhat "official". So that will be a nice sideeffect to clarify the "official links". Will restart the download later today ... [12:57:06] seppl2023 still trying to find my notes, but I think https://wikimedia.bringyour.com/ has the fastest mirror [13:01:22] well...damn. Looks like they don't mirror it anymore [13:01:53] https://wikimedia.bringyour.com/other/ is a 404 [13:23:22] inflatador: I'll be there on time! [13:27:38] gehel ACK [13:42:26] Just marked previous attempt as failed and started a new one at https://wiki.bitplan.com/index.php/Wikidata_Import_2023-04-26 using a download script for reproduceability and to start 4 downloads in parallel to avoid timing problems. Download rate for the RWTH i5 server is > 10MByte / sec which i consider good enough. [13:44:24] I am going for the ttl files doesn't the munger use the json files or should i simply download json, ttl and nt files for reference? [13:46:42] seppl2023: only ttl files, also note that you're downloading both bz2 and gz versions of the same file [13:50:00] I think we do use the bzip version when reloading [13:57:04] dcausse you have anything today for pairing? Was thinking of integrating transfer.py into data-transfer.py if not [13:58:12] inflatador: not going to ship anything wikikube during the switch so I don't have anything [13:58:34] dcausse OK, let's skip for today then [13:58:42] ok [14:32:04] dcausse, mpham: WMF/WMDE checkin: https://meet.google.com/bfe-uzwh-ytj [15:26:45] ottomata: We’re currently facing an issue with delete events on the page_chage topic: If the deleted revision/page was a redirect, it’s hardly possible to figure out the redirect target from outside MW. Would you think, that augmenting the delete event with that information would be okay? [15:36:02] pfischer: i think that's a good idea, why not! [15:36:09] esp. if we have that info in the hook. [16:07:05] workout, back in ~40 [16:46:56] back [17:29:36] dinner [18:02:50] lunch, back in ~40 [18:32:20] realized i wrote the patch to change the mjolnir daemons to use the conda envs, but never deployed it. Going to deploy now [18:39:18] close, but not quite :( the conda env isn't natively relocatable, basically the shebang lines have absolute build path in them but it would work it it did env/bin/python env/bin/