[02:13:25] ebernhardson: if you're still around, https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Weird_and_unhelpful_autocompleted_search_results [04:21:55] odd, looking [04:30:52] enwki_titlesuggest is suspiciously small, 24k docs [04:45:33] titlesuggest build in eqiad totally failed, started at 2:33:42 and finished at 2:34:20, compared to yesterday it started at 3:09:45 and finished at 4:05:44. probably due to it expecting es7 but talking to an es6 cluster (but noone is querying eqiad, not a huge problem) [04:49:49] codfw less clear, enwiki was still rebuilding when issues were reported, that should have still been on the old index before the new one in promoted [04:50:30] other indices seem plausible, dewiki has finished it's rebuild and has what looks like a normal amount of docs [04:56:12] oh! yesterdays enwiki build failed...it seems whats happening is the build process in wmf.27 is only valid against es6, wmf.28 is only valid against es7. We should probably have done something weird like stopped the rebuilds and accepted old completion indices during the switchover, but it shuld all fix itself over the next couple hours [04:56:41] (and should have realized this index build had to be compatible in both directions, not something we thought about...) [05:01:11] I'll switch $wgCirrusSearchUseCompletionSuggester to build, which will use prefix search everywhere. Around 8 or 9 am UTC rebuilds should be complete everywhere and can revert it [05:04:50] (and the logs are now a bit spammed up with index_not_found errors on enwiki's codfw build today because i restarted it when it probably wasn't necessary) [05:10:37] one potential problem, autocomplete is cached for 3 hours :S so logged out users can still get bad results for awhile [05:11:19] ebernhardson: thanks, for reference it's now at https://phabricator.wikimedia.org/T317381 [05:17:08] legoktm: thanks for letting us know again, try to test these things but always miss a few details... [05:17:19] :)) [05:17:51] I do think it would be good to send out some kind of announcement, however late, that the upgrade happened and that issues are somewhat expected and should be reported in phab [05:18:09] also it's a pretty big upgrade!! :D [05:18:56] there was an announcement, a couple actually, but perhaps should have sent another as the train was finally hitting prod wikis [05:19:14] i suppose that was to wikitech and not somewhere end-users would see [05:20:04] gehel can probably do something once eu starts their friday, i'm terrible at comm's :P [05:20:19] oh, there was [05:20:26] > We are anticipating a 3-week migration process, starting on August 29 2022. [05:20:44] too long ago :p [05:20:55] But a "this happened now!" update would be nice :) [05:21:01] for us it's three weeks migrating two clusters, but for end users it looks like it all happens over 3 days (or actually 1 hour, because group1 and group2 went within a couple hours of eachother) [05:21:40] * legoktm mentions that on WP:VPT [05:32:53] ok, I threw a note on https://meta.wikimedia.org/wiki/Tech/News/2022/37 [05:33:02] > The search servers have been upgraded a new major version, if you notice any issues with searching, please report them in Phabricator. [1] [05:33:11] thanks! [05:33:16] please edit/improve/etc. as desired :) [05:33:42] the deadline for changes is sometime Friday in a US timezone [07:35:17] compsuggest is still building for codfw but the output is a bit weird, "thwiktionary #01134% done..." [07:38:51] enwiki has 9M doc so this seems correct tho but it's now wrong in eqiad... [07:39:20] I thought that wmf28 would have been at least be able to build against es6 :/ [07:46:19] building enwiki_titlesuggest in eqiad is way too fast, something's not working well... [07:53:43] j [07:54:31] hmm the weird "#011" is due to syslog apparently (https://bugzilla.redhat.com/show_bug.cgi?id=1517455) [08:18:09] still not quite working in codfw tho: enwiki index: /enwiki_titlesuggest_1662694183/_doc/10573527t caused no such index [enwiki_titlesuggest_1662694183] [08:23:18] hm... I suspect the script might have run concurrently, it went from 1662604473 trying to create 1662694183 and then failed with no such index enwiki_titlesuggest_1662694183 but has now "1662697947" (which I don't see in the logs) [08:27:11] I see a tmux session from Erik... might perhaps be that..., going to re-run all this [08:35:07] I've sent a rebuild of the compsuggest indices in codfw [09:17:55] not quite understanding why when we rebuild the comp suggest indices it can report "Indexing 222102 documents" eventhough track_total_hits is not explicitly set [09:24:01] ah looks it must be set to TRACK_TOTAL_HITS_ACCURATE in scroll requests and thus is the default [09:51:49] sigh... inc_links count is relying on total_hits... [09:52:03] it might cap at 10000 as well [10:06:04] it's building enwiki atm, regarding BC for wmf28 being able to build against es6 I'm not convinced it's worthwhile... but might be wrong [10:07:20] dump extraction does not allow passing a cluster so I suppose it'll operate on the good one (codfw) [10:15:44] lunch [12:15:47] Errand, back in a few [12:48:31] Hi! Do we have any docs on creating a new git repo? dcausse: Do we already have one for flink-search-update? [12:49:39] pfischer: yes we do have a project: https://gerrit.wikimedia.org/g/search/cirrus-streaming-updater [12:58:51] o/ [13:04:04] Starting on the incident report re:autocomplete, will probably need some help w/timeline but will do as much as I can [13:25:15] hm the way we loop over the Scroll depends on Elastica hability to get total.hits as an object... I don't see an easy way to deal with this... [13:27:41] no way to es68 to have the right format with e.g rest_total_hits_as_int=false [13:28:09] and Elastica 7.1.5 does not support responses with rest_total_hits_as_int=true [13:29:25] will add a step to rebuild completion indices in eqiad we've done upgrading to 7 there [13:36:36] Prefix search is different from compsuggest, right? [13:39:33] inflatador: yes, it uses the main index, compsuggest uses a separate (optimized) index [13:40:55] dcausse thanks, trying to understand all this...would it be accurate to say we're in a degraded state when using prefix search as opposed to compsuggest? [13:41:20] inflatador: yes that's it [13:42:30] dcausse understood. And we're still waiting for the compsuggest indices to rebuild, or is that finished? [13:43:22] yes, Erik ran it yesterday but I've seen errors so I'm rerunning it now out of caution [13:43:46] it's doing "u*" wikis so it's getting close [13:44:59] Cool, would that be considered an "in-place reindex"? [13:45:23] (looking at https://wikitech.wikimedia.org/wiki/Search#In_place_reindex ) [13:47:46] inflatador: not really it's a different script doing this, it does not have a well known name, we generally say "compsuggest index rebuild", the script is UpdateSuggesterIndex.php [13:48:48] sigh... there's no doc about it :/ [13:49:04] No worries, I'll get one up. [13:49:41] Y'all have excellent docs overall, don't feel too badly ;) [13:50:44] it's a cron job running daily (we rarely run it manually) [13:53:09] I need to familiarize myself with the maintenance scripts anyway, pretty important [14:17:11] doing zh wikis, should be done in ~10mins [14:46:14] re-enable completion indices, everything seems fine [14:47:45] tp50 latencies are not great 40ms compared to the usual ~10ms but that's hopefully due to eqiad <> codfw penalty [14:48:55] we've failed a bunch of prefixsearch requests on the pool counter... should have seen that [14:51:19] perhaps tuning this poolcounter would have helped but not sure... [14:55:58] \o [14:57:07] also doh, i completely forgot about the poolcounter. Indeed prefix has a tiny poolcounter compared to completion [15:00:46] o/ [15:01:18] no big deal I guess, might be hardly visibly when you type I suppose? [15:03:56] hopefully, it's doing around 1k qps with 30/s rejected, so it should be mostly unnoticable i would hope [15:05:03] surprised it manages 1k qps with a poolcounter of only 32 allowed in parallel, naively at 50ms per call i would have expected more like 700 as a cap [15:05:41] oh i guess it's not 32, because it's cross cluster that gets bumped to 64 [15:07:31] compsuggest seems *way* slower in es7 btw, eqiad reported p50 of 10ms, after the switchover p50 was 100ms. but the round trip is <40ms of additional latency [15:08:24] seeing 40ms on p50s [15:09:04] hmm, at https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=52&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 ? [15:09:21] oh, silly me...thats a 7 day graph [15:09:33] i thought i was looking at 1 day, and the bit in the middle was the hours we were deployed [15:09:48] yea, it was 40 :) nothing to worry about there [15:10:19] :) [15:14:02] these 99th percentile times always make me a bit sad...but still no clue what to do about them [15:14:24] 400ms+ on comp suggest seems crazy [15:14:29] yes :( [15:20:56] I'll skip unmeeting again tonight, another ERC meeting... [15:23:32] Incident report draft, I'm still working on it but feel free to edit https://wikitech.wikimedia.org/wiki/Incidents/2022-09-09_Elastic_Autocomplete_Missing?venotify=created [15:28:28] random guesses at what we need, the search reindex process looks at the old index and the new index, and if the counts are way off it declares failure. Here we build an index with 25k docs instead of 9.5M and the automation said "that looks fine, promote it" [15:28:54] keeping the old index that was a day out of date would have been better than promoting the new index [15:30:05] cool, added to https://phabricator.wikimedia.org/T313095 [16:04:26] going offline, have a nice week-end [16:14:46] hmm, mjolnir feature selection failed this week due to encountering NaN's which we apparently never got with es6 :S should be fun [17:26:03] maybe get lucky and some feature definition is no longer valid, will have to poke around [17:41:54] lunch, back in ~45 [18:30:16] ebernhardson: when do we expect https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/830953/ to take effect? is this something that doesn't get uptaken until the next train deploy? [18:30:33] hmm [18:30:56] ryankemper: yes that will either require a train deploy or backport window [18:31:47] friday deploys are limited to emergency situations, and i dont suppose this is an emergency [18:34:15] ebernhardson: yup, agreed. so probably the monday US backport window [18:34:28] sorry, been back [18:34:32] I'll update that village pump thread [18:37:38] https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand is the long form description of why elsaticsearch made the change [18:43:18] as an aside, does anyone know how https://wikitech.wikimedia.org/wiki/Deployments gets updated with new weeks? i.e. the page goes till sep 10 currently, I imagine there's an edit coming in a few days that populates the next 2 weeks of events but not sure [18:45:25] ryankemper: example would be https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2008443&oldid=2008325 , there is a followup revision that drops an older week. i don't know if they have any kind of script that does it, or if tyler simply prepares the edits [18:45:55] it seems like the kind of thing they would write a script that they give a day and it emits the wikitext for th week [18:46:23] or maybe they have a template, i guess we wouldn't see from the revision if it uses a template with subst [18:46:56] ack [18:47:08] ebernhardson nice article! Do we know if the performance penalty from using "track_total_hits" will be measurable? [18:47:23] well for now i'll just set a personal reminder to add https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/830953/ to the monday backport deploy on sunday night [18:48:27] inflatador: hmm, that kind of thing isn't easy to know for sure, other than by running queries both ways and using enough to get a representative sample [18:51:29] a big conceptual difference is before, if elasticsearch says there are 2.3M results, that means it really did touch 2.3M individual items in memory (possibly pulling data off disk) and do at least a basic first-pass scoring run against them [18:51:43] with WAND it's able to skip things without even doing that scoring pass [18:55:00] in an example query for `a the` which has 6.2M results on enwiki_content, i see 1008ms when tracking, and 780ms when not tracking. But individual queries like this are hard to extrapolate from because of multiple caching layers that might be pre-populated by the first query. A second invocation with track_total_hits came back in 878ms instead of 1008 [18:55:43] so, something but not earth-shattering in the kinds of queries we issue [18:57:54] i suppose randomly useful, our large rescore window_size also makes it slightly less efficient, we use a window size of 8k which means elastic will have to score at least 8k docs per-shard (*16 for enwiki) regardless [18:58:59] if i shrink the window sizes down to 128 we see ~360ms when not tracking, and 869ms when tracking [18:59:35] slightly more earth-shattering :) [18:59:54] but also the effect is larger here because this one has 6M results, a query with 400k results wouldn't see such a large change [19:01:50] * ebernhardson wonders if we should re-evaluate window sizes at some point though, before we set them a bit high because it seemed to help a bit and didn't have a huge perf impact [19:02:47] It's a cliché for sure, but the more I know, the more I realize I don't! [19:04:58] i suppose if it explains some, scoring of full text search on large wikis happens in three passes, the first pass calculates the top 8k results per shard, then those 8k results have their score modified by incoming link counts and page popularity to determine the top 450 results per shard, then those top 450 are run through an ML algo (decision tree) which replaces the earlier score and [19:05:00] determines the final ordering [19:11:18] in literature this would be referred to as telescoping, just a fun word :) [19:11:58] google clearly does some form of telescoping, the quality of results seem to drop off dramatically from the first page to the second, i doubt the first and the second page recieve the same scoring treatment [19:13:22] yeah, that does help. I remember some of this from the offsite [19:15:14] ryankemper ebernhardson what's your feeling about rolling out https://gerrit.wikimedia.org/r/c/operations/puppet/+/830923/ today ? I know it's not at all related to last night's issue, but I feel a bit wary on a Friday afternoon [19:16:01] since the minimum_master_nodes flag apparently is a no-op in elastic 7 (except during the 6->7 upgrade), probably no harm in waiting [19:16:22] i suppose this would also add some master nodes, docs claim thats now an entirely painless operation [19:35:47] ryankemper any other thoughts? Based on e-bernhardson's feedback and what we saw earlier this week, it seems pretty benign. Otherwise we can work on the elastic-at-a-glance script if you like [19:38:22] (or everyone's favorite, incident reports) ;P [19:42:12] inflatador: there's enough other stuff to work on with the incident report, etc that we might as well wait till monday on the master stuff [19:42:57] inflatador: i can pair starting in 5-10 mins if that works for you, otherwise i'll grab lunch and pair at our normal time [19:42:59] ryankemper ACK , we can put it off then [19:43:13] full disclosure though my cognition is only going to trend downwards as the house gets hotter though :P [19:43:26] ryankemper OK, let's do now then [19:43:36] inflatador: cool, I'll be there in 5' [19:45:13] cool, I'm up at https://meet.google.com/cnw-ftpg-pmd . ebernhardson if you are able to join and QC https://wikitech.wikimedia.org/wiki/Incidents/2022-09-09_Elastic_Autocomplete_Missing that would be good, nbd if not though [20:38:51] inflatador: ebernhardson: oh one tiny thing on our event timeline, the first item says `8 September 2022 : the 08 September UTC late backport window train rolls out`, but I think it was actually the deploy train where the behavior would have switched right? [20:39:09] or more accurately, deploy train goes out, then the following day's (or maybe that night's) titlesuggest index rebuild goes all wacky [20:39:32] it was actually the previous days index build that was bad [20:39:56] on wednesday wmf.27 was runnning for enwiki, it tried to build the index on codfw and failed. When we switched to wmf.28 all traffic went to codfw and read the bad index [20:41:55] thats actually part of why it took 45min to diagnose, i was looking at the thursday build (which was in-progress) and trying to understand how that would have broken things [20:43:41] (i suppose technically, in utc those were thursday and friday builds, but they were wednesday and thursday for me :P) [20:59:42] ack, thanks [21:06:53] * ryankemper just realized the elasticsearch settings check alert hasn't cleared because he ran the script for psi, but it actually needed to be ran for the other two clusters [21:08:01] ryankemper ebernhardson ACK, let me try and clean up the page [21:08:04] ran for the other two, should see the alerts clear [21:12:44] inflatador: sounds good, I'll make another pass on the page after I get some food [21:14:38] OK, made some changes. Back in ~15m [21:42:26] back [22:03:31] I'm out, have a great weekend all