[16:30:00] I'm reviewing an issue where when we extra metadata from a re-uploaded file we end up sourcing the old metadata, rather than the new metadata. I'm assuming this is a master vs slave issue (without particularly strong proof). I guess my question is, I was reviewing that last time i fixed a similar issue by using READ_LATEST and the suggestion was the READ_LATEST is improper, and we should [16:30:02] use ChronologyProtector instead. Can i use Chronology Protector from jobs? [17:07:42] ebernhardson: Which job is this? Is the query within the Job or another POST specific code path? [17:08:41] The EventBus job runner was deployed with CP turned off. However there is no reason why it couldn't work within a specific job. [17:10:02] There a few jobs that do something similar. Although the ones I see in Codesearch do something much simpler which is to connect to the primary and then wait for replication. Gets the job done I suppose and takes less than 0.1s in practice. [17:10:56] But one could write cp position to a job parameter and ask LB to waitFor that before calling getConnection. I don't think anything currently does that though. [17:11:00] Krinkle: These would be the CirrusSearch ElasticaWrite and LinksUpdate jobs. [17:11:25] Krinkle: ElasticaWrite are created by the LinksUpdate jobs, the LinksUpdate jobs are triggered by the core LinksUpdate jobs [17:11:32] ebernhardson: extract from the media file.... That means opening the file from Swift right? [17:11:50] Krinkle: it can, although thats not the part i'm having trouble with. It's the metadata that gets stored in the image table [17:12:31] the short story here is some files had 0x0 dimensions, and editor went through and fixed them (not clear how), but the re-uploaded files still get 0x0 dimensions. Forcing a refresh gets the new dimensions, hence the assumption it's a race condition [17:13:21] by 0x0, i mean stored in the search index based on content sourced through core FileContentHandler::getDataForSearchIndex [17:13:59] by our metrics it's common for jobs to run in a few hundred ms after they are enqueued, so that also makes a race condition seem plausible [17:14:52] i could set these up to simply wait, although this runs hundreds of jobs per second i might have to try and estimate if that will change the loading on those servers [17:17:42] Afaik LinksUpdate itself performs the same trick already and waits for replication upto the READ_LATEST stats before the bulk of the update/job runs [17:20:17] Krinkle: hmm, is that the transaction ticket bit? [17:20:31] So if I understand correctly, the hypothesis is that on reupload, we save a new file to Swift, we extract (directly or via job?) new metadata and save it to MySQL. Then queue a links update, which queues cirrus update, which reads stale metadata from MySQL [17:20:54] Krinkle: thats my current theory, but indeed i've been unable to directly reproduce [17:21:36] ebernhardson: does this information become part of a ContentHandler or ParserOutput object at some point? [17:22:41] Krinkle: i've been digging through how the file uploads work late yesterday, but sadly I can't answer that question yet. I think the metadata updates are being done as part of the main upload (unexpected) request, but not 100% [17:23:02] unexpected? [17:23:29] I suppose i expected metadata extraction to happen later as an async update [17:24:10] metadata generally seems extracted and stored via LocalFile::recordUpload3 [17:25:13] I think it's generally seen as cheap if-and-only-if the file is on the local server in a temp dir, which it would be during upload. It'd also make sure that the next page view is complete right after upload. Not unlike how during an edit we perform LinksUpdate (to save what we already have in memory from the parsed output). [17:26:40] that makes sense i suppose, the file is already around and we are rarely dealing with giant files [17:27:17] I might have to poke around this some more though...if the core LinksUpdate is already ensuring replicas have caught up then it's unlikely to be a race condition [17:35:32] ebernhardson: RE: ticket, no, waitFor happens usually via waitForPrimaryPos or commitAndWaitForReplication. [17:35:43] ebernhardson: I think I'm wrong about RefreshLinks waiting before runnning the update [17:36:05] LinksUpdate does this, which is the post-edit deferred, not the RefreshLinks job. [17:36:41] RefreshLInks calls into that, but I don't know in which order that goes, but certainly makes it less likely still indeed [17:37:13] https://performance.wikimedia.org/arclamp/svgs/daily/2023-08-14.excimer-wall.RunSingleJob.svgz?x=1096.5&y=805&s=waitFor [17:47:08] oh actually i'm mistaken, while for most pages our cirrus LinksUpdate job comes from the LinksUpdateComplete hook, we have a one-off that attaches to UploadComplete and creates the same jobs (not sure why, aren't all uploads also new revisions?) [17:48:28] UploadComplete gets fired by UploadBase::performUpload, so most likely in-process of the upload request [18:01:26] ebernhardson: $this->jobQueue->push instead of lazyPush() means it happens directly indeed [18:02:14] and media reuploads are definitely toward the slower end of web responses given multi-dc swift uploads and non-trival db writes (image/oldimage cross-table row move etc.) [18:03:11] yea seems like at a bare minimum this needs to switch to lazyPush [18:03:15] ebernhardson: I think its true that reuploads insert a revision, but not a content change. It's a dummy revision akin to page protection. [18:03:32] I don't know off-hand whether those trigger the same LinksUpdate or not. [18:03:51] ahh, that could make sense [18:10:23] That seems worth fixing if not though, because parser can vary on file information and protection status [18:10:31] I'm pretty sure at least for protection this is taken care of. [18:11:59] although both Upload and ProtectionForm use newNullRevision eventually, and from a quick look in WikiPage it appears the call to getDerivedDataUpdater (which takes care of LinksUpdate) happens at a higher level than the low-level methods that newNullRevision uses. E.g. doUserEditContent, which it seems like protection and upload don't call. so if protect triggers updates, it's likely due to something else it's doing. [18:12:05] * Krinkle reproduces locally with debug toolbar [18:24:33] seems like I might have enough to go on here to get this resolved now, thanks! [18:25:30] ebernhardson: I continued in #mediawiki-core by accident. https://wm-bot.wmcloud.org/browser/index.php?start=08%2F15%2F2023%2012:00:00&end=08%2F15%2F2023&display=%23mediawiki-core [18:26:36] I suggest joining given the perf team no longer exists officially. As a value stream we may keep this channel, TBD. [18:28:39] ahh, i suppose i knew that but when i looked up https://www.mediawiki.org/wiki/Wikimedia_Performance_Team i was under the impression that maybe the change hadn't been finalized yet [18:31:50] Indeed, for some definitions of finalized it hasn't. It's a slow transition.