[16:06:23] Krinkle: a few days ago you were testing something on testwiki with delete/restore against a page that was created in the middle. This happened to expose a problem we have in our new updater, but leaves the question of what the right behaviour is there. If you have a chance could you weigh on in T351411? [16:06:23] T351411: Undelete of page with same title leads to unexpected results - https://phabricator.wikimedia.org/T351411 [16:15:36] ebernhardson: I'm not sure I see the problem. I think undelete could be labelled or explained better but this all looks fine as far as I'm concerned. It definitely did have an effect. Undeleting the old page's revisions did exactly that. [16:16:36] The latest revision remains the latest. It's history that is being restored. The UI has a list of checkboxes even [16:16:58] Krinkle: i tend to agree, critically the undeleted revision was added to the page history of the new page so it did something. Is that the correct behaviour? I suppose the most surprising thing to me here is that the rev_id of a page is not constant [16:17:21] are there other cases where thats true? [16:17:25] I'll comment as well but just checking what I can best focus on in my comment [16:17:29] err, the page_id of a rev i mean [16:18:10] Ah I see what you mean. Yes, this is essentially what it means to merge two pages. [16:18:17] Special:MergeHistory [16:18:55] Delete, move (or recreate), undelete [16:19:06] That sequence is the OG way to merge history [16:19:14] ok, interesting. I guess i've never seen that [16:19:47] Imagine someone wants to rename a Wikipedia article [16:19:58] They copy into a new page and make the old a redirect [16:20:20] That's not how you're supposed to do it of course, we have a move page feature [16:20:26] sure [16:20:39] But some people do, esp when not logged in and not have the move right [16:20:50] This allows the community to rectify such situations [16:20:58] oh interesting, yea that makes sense [16:21:58] page IDs are stable within the life of a page afaik, at least in the last 10 years of MW [16:22:18] But yeah something that was deleted could be imported or restored into a new page [16:22:33] Import also backdates new Rev IDs [16:23:06] So you'll have page latest = rev id 10, with history going like 1,2,3,100,102,105,10 [16:23:21] Xml import that is [16:23:28] ok, i suppose we'll have to consider the right way to handle those. We wrote some parts assuming that if we get a rev_id and page_id from an event, then we fetch the rev_id from mediawiki, if the page_id differs then something went wrong. It sounds like we'll need to do some process to verify what happened there [16:24:37] Dumps or search? [16:24:42] this is search update process [16:25:48] Hmm why based on rev id instead of page title as the search result URL would be? [16:25:58] Ie refresh to latest [16:25:58] if this is the only way we might be able to simply ignore it, there will be another related event and this restored event only updates the history and not the currently visible content [16:26:15] Krinkle: primary key in the search database is page id [16:26:22] all updates are by page id [16:27:20] Ok, that feels odd. Does that help with something else? [16:27:35] it's the way lsearchd was designed, and was imported :P [16:27:53] Wow, pre elastic? [16:28:06] at least, afaik. That design of the search architecture was done before i joined the team [16:28:14] Right [16:29:10] If you run by rev ID does that mean you're trying to get past parser output when the stream is lagged a bit? [16:29:33] Ie when you're multiple edits to the same page behind [16:30:19] if multiple edits happen with ~5 minutes they get combined into a single event with the most recent winning (needed to reduce edit rate on wikidata), but yes it's possible for it to request an older rev_id, which gets updated, and then as it catches up it would do the newer rev_id [16:31:15] which i don't find strictly necessary, but my teammates found it helpful [16:32:05] might be reason in the design doc (https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit#heading=h.2uz7a37z9yhl) but i'll need to review a bit more [16:33:00] ok, I think that can easily cause bugs because if it's asking the API for search data from a parser output from a non-current revision (ie. by rev ID instead of page ID or title), then that means you're quite often asking MW to parse a old revision instead of fetching from parser cache (given we only cache 1 revision generally), which is not only slower than need be, but also producing non-canonical results. You may very well be [16:33:00] perpetually indexing information that was never presented to a real person, since MW to a first approximation does not support a way to retroactively parse what a page used to look like. Proably 99% fine in practice of course, but I'm sure you've come across cases where an old history page no longer renders correctly when someone changed how a template worked. [16:34:06] MW generally designed refresh links based on title, and each update refreshes whatever the "current" data is. and updates deduplicate earlier in the queue, or no-op at runtime, if the touch time they were queued for is older than the page_links_updated by the time they execute. [16:34:26] trying to catch up seems like it'd be wasteful and putting pressure on something that isn't officially supported. [16:34:55] but, I also lack the 12 month context, so --- you know, do what you have to do. [16:35:53] this is all quite helpful, thanks! I'll bring this info back and we will ponder it, i hadn't considered the effects on PC of the old revision fetches but that does seem like a strong enough issue to consider a change [16:37:16] ebernhardson: the issue you ran into with the updater, do you need something specific to that, or was that something you worked out internally already? i.e. can I help with something in the API response or with how to interpret it to get the right outcome for you? [16:37:52] Krinkle: for the moment what we've done is redirect those events into what we call the "fetch failure" topic, which is essentially a dead letter queue [16:38:10] but was considering if we need to do anything else, will have to ponder [16:38:43] Krinkle: i think the main question was if this was the right behaviour, peter and I were coming to different conclusions by looking at the outputs so i was curious for a third opinion [16:39:17] ack :) [16:39:20] it seems like yes, mediawiki is doing a sane thing, this is equivalent to merging page historys, and it should be expected [21:32:46] history split/merge is one of those things that probably shouldn't exist but has ossified into community practice too much by now to be realistically removable