[10:07:11] lunch [12:44:11] dcausse: thank you for the weighted tags review. Would you want two streams (batch + real-time) right from the start or should we start with a single stream for now? [12:46:46] o/ [12:46:50] pfischer: I'd start with a single stream to cover the remaining writes we perform from CirrusSearch [12:46:55] o/ [12:47:52] dcausse: alright. [12:48:25] supporting bulk writes would be for imagesuggestions, but this one is not relying on CirrusSearch so it can be done later, time for us to ponder if a separate stream (rate-limited) is necessary [12:49:56] we seem to have https://grafana-rw.wikimedia.org/d/000000250/elasticsearch-percentiles-beta?orgId=1 which is not working and haven't changed since 2018, I guess we could delete it [13:30:32] \o [13:31:05] o/ [13:33:24] logspam of T370770 since last friday... [13:33:25] T370770: Error: Call to a member function audienceCan() on null - https://phabricator.wikimedia.org/T370770 [13:34:07] hmm [13:35:23] not sure how we can see that... we get rev ids from mGoodRevIDs which is supposedly verified with sql query [13:35:54] yea, a quick look through makes it seem like an api regression, but would probably have to reproduce to see where exactly [13:36:56] o/ [13:37:21] could mw hit different mariadb replicas during the same process? [13:38:29] hmm, sadly our loadbalancing has changed enough over the years i'm not entirely sure. I feel like it used to choose a replica and stick with it [13:38:55] could these be related to wikidata and them coming through earlier now? [13:39:54] that would lean toward hitting different replicas perhaps, although 10s of lag is still more than they usually see [13:40:44] i guess the s4 is wikidata, and it has been showing bits poking up past 10s: https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1&viewPanel=48 [13:41:35] dcausse: ottomata: while consuming the weighted tags I noticed we depend on wiki_id but that is neither part of the page fragment nor of the meta fragment. There only is meta.domain. Is the mapping from domain to wiki_id ambivalent or could we use a LUT instead of passing a potentially redundant property? [13:41:37] yes it's almost all from wikidatawiki indeed and seems to lineup with the deployment of the wikidata optimization [13:42:33] pfischer: we can do a LUT for domain->wikiid, i'm sure i've done that elsewhere in our analytics bits [13:42:33] pfischer: that's strange, we should def pull the fragment that declares wiki_id imo [13:42:40] heh :) [13:43:06] yes we could do domain -> wikiid but seems easier to make it a required property imo :) [13:43:12] nope, the wiki_id comes with the page entity and we do not include that [13:43:13] sure [13:47:04] as for wikidata, reproducing a problem with replica lag would be tedious...hmm [13:49:10] yes... [13:49:56] RevisionStore seems to have a fallback to run with READ_LATEST if it's not found... [13:54:27] can set up a replica in mwcli, i think can also force it to lag. might be able to reproduce locally [13:55:55] docs claim it's a one line sql command to set the source_delay on a replication source [14:37:08] fwiw, on my local dev with a 5 minute slave lag it can create an entity and immediately fetch the cirrusdoc without (obvious) issues [14:40:37] oh, actually maybe not... [14:40:48] but in the bad cases it gets a 'badrevids' [14:42:20] yes that's what I would expect to see when being faster than db replication [14:42:47] yea [14:50:13] looping a few hundred times doesn't seem to tease out any different responses [14:52:26] but maybe as you suggested earlier, it requires multiple slaves with different lag [14:53:03] i guess i can try against testwikidatawiki, but unsure if i should be creating hundreds of duplicate items there [14:53:24] i guess i can rewrite this to edit, but new entity can spam the same request over and over :P [15:07:36] yes, seems hard to reproduce :/ [15:08:51] if we're seing different replicas then why the fallback to the master would not work at https://gerrit.wikimedia.org/g/mediawiki/core/+/cf36ccf3c4529eebbbc9023e4e44d14dc9d2dfa6/includes/Revision/RevisionStore.php#2321 [15:09:32] yea test wikidata doesn't like me, gives 403 forbidden [15:09:42] gave it a user agent, but it wants something more [15:09:45] :) [15:11:12] seems trappy to have services hitting different replicas [15:13:43] indeed, lucas might be onto something that normally it does, but we found some edge cae [15:13:45] case [15:15:24] but even so, as you said RevisionStore should have fell back to master [15:21:18] possibly $this->loadBalancer->hasOrMadeRecentPrimaryChanges() is still an approximation [15:21:44] dcausse: i was just looking at that too :) The default limit is 10s, and lag is sometimes higher [15:22:54] perhaps we should not trust ApiBase and set badrevids ourselves? or we could use forcibly use READ_LATEST but that seems not great [15:24:24] well... Timo and Aaron might certainly know what to do here :) [15:24:28] dcausse: hmm, maybe we could define our own second-layer check on recent primary changes and go to READ_LATEST if it's within a large window? [15:24:31] yea [15:25:20] on the other hand, i suppose i would be surprised a bit if wikidata didn't have a db change in the last 10s...but that's only a wild guess [15:26:26] yes... I don't remember seeing such amount of replag issues with the wdqs updater, bit it's not the same API so hard to compare [15:27:33] dcausse: hasOrMadeRecentPrimaryChanges refers to changes in the same process, e.g. a POST request saving an edit. [15:28:01] Generally speaking there is no fallback to primary db, or else bogus urls would add a lot of primary db traffic from otherwise idempotent GET requests. [15:28:21] generally speaking it should no be possible to know about something in a read context unless it's already available in read context. [15:28:56] ahh, should have read deeper into the lb code. that makes sense [15:29:36] Krinkle: thanks! [15:29:43] in terms of your own GET requests after POST, that's what ChronologyProtector is for, it makes sure to pick a replica that's caught up to your own edits and/or waits until there is one. [15:30:10] this is actually about receiving revid's from event streams, ApiPageSet declaring them "good" rev ids, but RevisionStore not finding them [15:30:34] could they be recently deleted? [15:30:43] i.e. do they exist now? [15:30:47] yes [15:30:58] k, gotta go, back an an hour [15:31:36] yea repeating api req's from the logstash output finds the requests work now [15:31:37] but yeah we prefer sleep() in php (via ChronoProt) over READ_LATEST in non-job/non-POST, that's how much we avoid it :) [15:32:25] so perhaps adding an artificial delay of ~10s from the consumer by help reduce this [15:32:45] chronology protector data is no longer available from the event streams [15:33:56] but that means we should expect null RevisionStore so a fix is required anyways in that cirrus api [15:33:57] an artifical delay would at least make it happen less, but no guarantee's [15:34:04] indeed [15:34:45] given the logspam I'd be for a small delay to let more replicas have the data [15:35:06] seems reasonable, probably delay on the updater side? [15:35:32] yes, not sure which part tho, consumer or producer? [15:35:33] could sleep in the php side instead, but would risk consuming a bunch of php-fpm workers with our volume [15:36:13] yes probably not ideal in php [15:37:02] hmm, producer could specialize to only delaying the ones that skip merge windows without much duplication [15:37:09] sure [15:38:09] * ebernhardson is reminded need to look into the parser caching issue to also consume less php-fpm workers [15:38:49] i thought we had a ticket but not seeing it on the board [15:40:43] indeed discussion about this only happened over irc [15:40:51] i'll make a ticket then [15:41:00] s/ticket/task/ [16:46:57] lunch, heading into office...back in ~1h [16:54:48] dinner [17:32:02] back [18:08:50] pfischer: we went with wiki_id because it was more canonical. meta.domain semantics are a little undefined, but it is often used as dns domain name. But, domain names can vary even in the same wiki, e.g. en.m.wikipedia.org is still enwiki [18:09:09] database name generally maps to wiki_id, but not necessarily [18:25:44] * ebernhardson wasn't aware there were wiki_id <-> dbname mismatches, may have changed but there was a time when they were used interchangibly [18:29:55] for the Search Platform team folks in case Phabricator pings get buried in your email: request in https://phabricator.wikimedia.org/T370661#10008073 for your review [18:31:36] gehel: 5' for 1:1 [18:58:54] pyspark has the best error messages: py4j.Py4JException: Method or([class java.lang.Integer]) does not exist [19:00:42] lol [19:02:39] dr0ptp4kt: initial skim of the decision record looks great, very thorough. I'll do another pass later to see if there's any SRE-specific stuff that should be added but for the timebeing I think it does a good job summarizing the major challenges & tradeoffs [19:03:15] I also took a read over it, seems reasonable and covers what we've talked about [19:03:27] feels lawyer-ly, if that's good or bad depends on the recipient :P [19:05:34] dr0ptp4kt: i don't know if it's worth mentioning, but i see that elastica did finally add elastic 8 support at end of may 2024, but the elasticsearch server was release feb 2022 [19:06:10] they also specifically said they will not be adding support for opensearch here: https://github.com/ruflin/Elastica/issues/2012#issuecomment-976469235 [19:10:44] unrelatedly, it turns out the or() failure is because surprisingly, `a & b > 0` is `(a & b) > 0` and not a & (b > 0) [19:13:49] which then gives some surprising numbers, if correct this says for the given hour there were 398k sessions will fulltext serp, and 371k of those sessions saw at least one result. Perhaps we've only looked at zrr in isolation before and not per-session? [19:15:32] gives a per-session zrr of ~6.6% [19:17:15] and abandonment of ~50.9%, half of sessions that see a result don't click throug [19:41:47] ty ryankemper && ebernhardson . i made a couple small edits to capture the additional and to point out the elasticsearch 8 support by elastica more explicitly - https://www.mediawiki.org/w/index.php?title=Draft%3AABaso_%28WMF%29%2FWikimedia_Search_Platform%2FDecision_Records%2FSearch_backend_replacement_technology&diff=6664650&oldid=6664377 [20:27:20] * ebernhardson wonders how we can have ~898k autocomplete submits that don't go direct to page, but only 674k serp's [20:27:42] i guess some are the middle ground of "typed a full title manually" [20:30:50] ebernhardson: I've always looked at ZRR per query. While I'm sure some people give up if their query fails, I assume people are willing to correct their own egregious (i.e., uncorrectable) typos. [20:31:24] Trey314159: it seems plausible, but i also noticed in this we only see ~1.7 queries per session, and the per-query zrr is ~10.5% [20:31:37] at least in this data, it could still be wrong :P [20:31:39] 6.6% per-session ZRR is not outrageous, but higher than I would have thought [20:31:59] Wait, 10% ZRR per query? Over what wiki(s)? [20:32:05] Trey314159: all wikis, single hour [20:32:28] Does that include autocomplete in the denominator? That's quite reasonable [20:32:33] Trey314159: this is limited to the search satisfaction events though, so data collected via desktop web [20:32:37] Trey314159: nope, just fulltext serp's [20:33:09] maybe something is wrong with my aggregation, usually surprising numbers means look harder :) I've been reviewing for the past couple hours though [20:34:04] That seems kinda low. I sometimes filter junk from my samples, but that should make my observed ZRR go down. I'd expect ~20%+ with enwiki dominating. [20:35:32] ...unless there's some shenanigans going on like Wikidata with 1% ZRR and huge search volume, or something. [20:56:15] does seem worth splitting by wikis/languages/etc and see what related. Will poke some more