[09:38:55] inflatador: when you're scheduling a conversation about flink / k8s, can you include Olja and myself? [09:43:02] Weekly status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-06-23 [09:56:57] lunch [10:23:01] lunch [12:04:53] 2:04 PM o/ I’ve been working on handling deletes of redirects when I noticed that we do not get redirect_page_link data for deleted pages. Given pages A and B with B redirecting to A. If I delete B, I would expect, that the redirect field of the ES document of A is empty afterwards. However, that cannot be implemented without knowing the redirect target. The MW hook onPageDeleteComplete gets passed a clone of [12:04:53] a WikiPage instance that is created before deleting the page. Since $page->getRedirectTarget() does not get called, the internal cache $mRedirectTarget remains NULL and there’s no way to restore that target, once the page has been deleted from the page and redirect tables. [12:15:18] pfischer: so for these cases I would look at if MW does not simply trigger a LinksUpdate on A [13:01:52] o/ [13:35:23] dcausse: Thanks! It does indeed trigger a links-changed event for the redirect source page (Page B), with an interesting payload: "page_is_redirect":false, "removed_links":[{"link":"/wiki/Page_A","external":false}]} [13:35:35] inflatador, ryankemper: I've checked https://gerrit.wikimedia.org/r/c/operations/puppet/+/930191 and I think it is ready to be merged (maybe not on a Friday) [13:36:11] I'll merge it on Monday if you review [13:36:36] gehel just looking at that, sorry I missed it before [13:41:05] inflatador: It's been ready for review for < 1h, you're all good! [13:41:38] Dropping off kids, back in ~20 [13:48:21] pfischer: actually CirrusSearch does use onArticleDelete which is triggered before the actual delete and thus is able to know the redirect (onArticleDelete) [13:49:42] so if EventBus could also somehow cache the redirect target via this hook and then re-use it when onPageDeleteComplete is called perhaps you could set redirect_page_link on delete page change events? [13:50:01] dcausse: Ah, good to know. Based on links-change we’d have to issue remove-redirect operations towards ES for every internal link since we do not know if this link was a redirect or a regular link. I’ll ask ottomata: to see if this is an event to be propagated by EventBus [13:55:01] I'd be leaning toward trying to set redirect_page_link on page-change event for delete operations [14:01:07] Me too, let’s see what #data-engineering has to say. [14:03:35] hm... just realized that the IncomingLinkCount job was also "indirectly" responsible for cleaning up stale redirects when e.g. changing the target of Redirect_A from Page_B to Page_C (without the IncomingLinkCount job Page_B still has Redirect_A in its redirect array...) [14:06:26] we disabled IncomingLinkCount on jan 12, can we consider that leaking redirects in this case is OK? [14:10:04] sorry, been back [14:11:37] going offline early, have a nice week-end [14:11:43] Good question. What effect does the redirect array and incoming_links counter have? Are they part used to score results? Where could I find code related to that? [14:12:07] dcausse: you too, see you! [15:01:25] pfischer: the redirect array is one of the strongest ranking signals, then when highlighting if there is a redirect match the UI will show up like "Albert Einstein (redirect from Chasing a light beam)" [15:02:17] similarly with incoming_links, thats a strong ranking signal used during the 2nd ranking phase [15:31:50] pfischer: i suppose i hadn't read the whole part, regarding code i suppose there are two different things to look at. There is the classical ranking which is profile based, look at profiles/FullTextQueryBuilderProfiles.config.php along with the related FullTextQueryBuilder [15:32:44] but thats only on smaller wikis, on the largest wikis they have the ML model which is generally found in the gitlab mjolnir repository. Because it's ml there isn't a specific place to point to and say thats where we make redirects important, rather thats found in analysis of the resulting models which i could probably find the historical reports for if you are interested [15:34:12] * ebernhardson now wonders if much has improved in terms of explainability of decision trees in the years since we ran that analysis [15:44:01] going to my son's play, back in ~1h [16:52:55] back [17:29:03] dcausse not that you probably care, but I was wrong about Nomad namespaces, they are now part of the open source version https://developer.hashicorp.com/nomad/docs/commands/namespace [17:55:01] meh...was trying to split our text content into paragraphs to feed into an embedding model, but the round trip through html to plain text stripped all the newlines out of the stored text so there isn't a simple way to do it :( [21:06:52] ebernhardson: Thank you for your explanation. So we better make sure we do not leak too many redirects. [21:07:26] ebernhardson: What content did you try to split? Wiki page content? [23:21:51] pfischer: yes, playing with vespa on our of our labs instances, the goal was to get paragraphs to turn into vectors. Being that i'm just toying with things i went for something we already have broken out, essentially using "{title} {heading}" and "{title} {category}" as the "paragraphs" to vectorize [23:23:25] it's been ingesting for a long time though :( i suspect i should have kicked up its heap size from the defaults