[09:10:59] pfischer: there isn't much about the Search Update Pipeline in our standup notes (https://etherpad.wikimedia.org/p/search-standup). Do you have something to add? [09:15:41] gehel: nothing search update pipeline specific. I can go on about coming up with a shared schema for (almost) all kinds of link types MW supports. [09:19:44] gehel: added two items, sorry for the delay. [09:21:40] Thanks [09:47:53] Trey314159: the acceptance criteria on T332355 include ensuring that the Cirrus Docker image is updated as well. Is that the case? Or maybe dcausse knows how to check? [09:47:53] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [09:48:37] Oh, that should probably just be part of updating the .deb package of our plugins, which will be dragged into the docker image? [09:53:28] gehel: I think we pin the exact version of the deb, lemme see [09:56:11] lunch [11:16:44] inflatador: re T332355 (please ignore if this was intentional) it seems that we uploaded the wmf search plugins deb package to the "thirparty" pool rather than "component" in the apt registry [11:16:45] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [11:18:12] lunch [11:55:35] dcausse: thank you for your comment https://phabricator.wikimedia.org/T331399#8898339 [12:46:50] weekly status published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-06-02 [13:09:18] dcausse I'm seeing the package under "component"? https://apt-browser.toolforge.org/bullseye-wikimedia/component/elastic710/ [13:09:42] actually, they're under both [13:10:13] I'm not sure where the correct place is...any suggestions? [13:10:23] pretty sure we have a ticket somewhere with the info [13:14:29] https://phabricator.wikimedia.org/T318820#8370139 [13:24:01] inflatador: indeed looks like it's a bit all over the place... I have no clue where's the best place, last one we used (plugins_7.10.2-4~bullseye_all.deb) was under component and the new one just built (plugins_7.10.2-5~bullseye_all.deb) is under thirdparty [13:24:30] Moritz might have ideas on where we should put it? [13:25:37] dcausse Ah, thanks for pointing out the version difference. I can at least update /component to use the latest version [13:29:29] dcausse:, ottomata: responded and now I’m curious: As of now, we do support looking up pages via CirrusSearch API in the source domain of the event, but not across domains/wiki instances. Right? So if we get a page_change event from fr.wiki for a redirect to en.wiki, that would require replacing the InputEvent’s domain with the redirect targets domain for the revision fetching to work as expected. [13:30:19] Are cross wiki redirects a thing (we want to support)? [13:31:16] pfischer: right I was about to comment about this, I think we would be fine if we just ignore (not set the link_target in the page_state schema) for links that points to something external [13:31:48] I mean if this helps not think about how to model interwiki links [13:32:21] we would just know that it's a redirect via the top level "is_redirect" [13:33:56] sorry I meant the "absence of the redirect" as added in https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/914867/7/jsonschema/mediawiki/page/change/current.yaml#49 [13:36:30] Hm, my implementation in EventBus currently would include interwiki redirects. I would expect that we decide inside our update pipeline, if we want to follow/resolve a redirect or not. [13:38:30] Alternatively, I would have to introduce `local_redirect` instead of `redirect` to make transparent, why it might not be set (in case of anything but a local link) [13:42:05] pfischer: if we can model an interwiki link I'm all for it [13:43:10] So if we want to handle interwiki redirects, the input event would have to carry more information about the target. Besides page_id and page_namespace, we’d also need the target domain, that would have to be resolved inside EventBus [13:47:07] yes... interwiki links can either be "local" or "external" one would be a target to a project+[possibly a language] (wikipedia, de) the other is more an URL (https://gerrit.wikimedia.org) [13:47:24] I'm not sure that a single additional to model all this might be enough [13:47:33] *I'm not sure that a single additional *field* to model all this might be enough [13:51:22] if wec an model interwiki redirects and have the data, i'm for it too [13:51:35] hm, should the page model include info about the wiki it is on?Hmmmm [13:52:01] woudln't want to use the interwiki prefix [13:52:21] but maybe we could look up the project /domain? [13:52:44] looking up interwiki info is pretty cheap [13:53:59] okay, so maybe we add some normalized project info to the page entity model [13:54:19] and then add the redirect_target_page as a field on chage/page model $refing page? [13:58:22] hm I'm looking into the \Interwiki class, this might not be very trivial, the actual info about the project/language is actually maintained in SiteMatrix IIRC (which we probably don't want to use)... [13:59:43] unless we model interwiki links as: domain, url (which then is very far from the page_link model) [14:01:06] OK, the latest wmf-elasticsearch-search-plugins package is now in the correct component [14:02:59] inflatador: thanks! [14:04:16] inflatador: the tar.gz and the description are not there, not sure if this is a problem tho [14:08:35] gehel, I added the docker image update to the acceptance criteria but I don't know where it happens. Sounds like dcausse does know more—as always—and he's in the right time zone to answer your questions, too! [14:11:31] pfischer: just saw your comment and sorry I was not very clear, page_domain might not be required for us as we'll never fetch something that's not on the same wiki as the source event [14:12:19] page_id+is_redirect is everything we'd need [14:12:44] if page_id is not set, we ignore, if is_redirect is true we ignore [14:14:25] page_id should only be set if the link is to the same wiki so it indirectly tells us if it's an interwiki link or not [14:16:27] well... page_id being null might also tells us that it's a broken link so please ignore my last statement [14:20:44] Ah, makes sense. I wouldn’t be able to look up interwiki target page_id anyways. So no need for a project/domain, as you said. [14:22:01] dcausse not sure myself, I'll ask moritz-m [14:23:18] no, the only drawback with what is suggested by ottomata is that the redirect_target_page would be unset even if it's a redirect when it points to an another wiki (interwiki), perhaps we should name it as redirect_target_local_page as you suggested? [14:26:37] and we need better naming overall everything's so confusing local page, "local" interwiki, external interwiki... [14:28:45] Instead of omitting it, we could still pass `redirect` and declare it as `allOf` `/fragment/mediawiki/state/entity/page` and an additional property of the interwiki/project/domain of the page. [14:29:37] \o [14:29:39] if we know what info we want why not? [14:29:41] o/ [14:32:58] sigh it seems we also have interwiki prefixes that points to the same wiki... so "local" interwiki is an ambiguous naming [14:39:59] Hehe, that could be something, I can clean/normalize inside EventBus, before the event is published [14:44:34] I'm sure MW does normalize at somepoint, on https://www.mediawiki.org/wiki/User:DCausse_(WMF)/Test I added a [[mw:Main]] link which is technically speaking an "interwiki" link but it does not get stored in the iwlinks table [15:18:14] Good to know. I’ll check how things look like from EventBus’ perspective on Monday. [15:30:27] have to work out a little early today. Will miss gaming unmtg ... and back in ~45 [15:31:22] going offline early, have a nice week-end [16:25:54] back, but my computer is acting up...going to wait for backup to finish and then restart [16:32:17] apparently ansible supports argument specs for roles now...pretty nice https://steampunk.si/blog/ansible-role-argument-specification/# [16:34:03] https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_reuse_roles.html#role-argument-spec [16:36:10] oops, ignore the last 2 posts [17:20:11] lunch, back in ~1h