[07:53:35] pfischer: (for when you're back) could you have a quick look at T397147 ? [07:53:36] T397147: Update failed, bad old data in the store - https://phabricator.wikimedia.org/T397147 [10:34:44] Hi, is there any wiki that is using db-backed search? I want to drop the searchindex table everywhere but wanted to double check if it's used in any wiki T397367 [10:34:44] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [11:19:10] * cormacparle waves [11:19:15] question about prefixsearch [11:19:35] how does fuzziness work? [11:19:38] https://hr.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=info|pageprops|pageimages|description&generator=prefixsearch&gpssearch=Svetog%20Rimskog%20Carstv&gpslimit=10&ppprop=disambiguation&redirects=true&pithumbsize=80&pilimit=10&cirrusDumpQuery [11:20:11] (I might have ask this same question before, but if I have I've forgotten the answer) [11:21:57] that's a prefixsearch for "Svetog Rimskog Carstv" in Hungarian ... it finds nothing because the phrase is inflected (accusative case maybe?) and the relevant page title is "Sveto Rimsko Carstvo" [11:22:34] I know a lot of our searching does stemming, but maybe prefixsearch does not? [11:24:11] hmmm I don't get the result I'm looking for in regular search either https://hr.wikipedia.org/w/index.php?search=Svetog+Rimskog+Carstva&title=Posebno%3ATra%C5%BEi&ns0=1 [11:30:29] It's similar in German - searching for "Heiligen Römischen Reiches" (genitive) only gives a result for "Heiliges Römisches Reich" because the individual words in the genitive case are in the text [11:30:31] https://de.wikipedia.org/w/index.php?search=Heiligen+R%C3%B6mischen+Reiches&title=Spezial%3ASuche&ns0=1&searchToken=4irz89pr0kq2e82sqoaxfa4xn [11:30:42] maybe I'm misunderstanding how stemming is supposed to work? [12:19:32] cormacparle: that sounds like a question for ebernhardson or Trey314159. They should be around later. [12:21:08] Amir1: I'm pretty sure that we don't use DB backed search anywhere, but confirmation from ebernhardson would be best [12:54:08] \o [13:48:41] .o/ [13:48:56] Amir1: nothing should be using db backed full text search [13:49:37] cormacparle: prefix search doesn't do stemming, it's a strict prefix search. That's why we replaced it with the completion suggestor for most things [13:49:59] the completion suggester also does not do stemming, but it finds things within a levenshtein distance of 2 [13:59:48] I feel somewhat stupid to ask, but what is Glent method 1? I'm always confused by those numbers... [14:00:15] And of course, the answer was 1 click away: https://phabricator.wikimedia.org/T212889 [14:00:44] gehel: i did rename them :) m0 = session similarity (queries that are similar within the same session). m1 = query similarity (queries that are similar across all queries) [14:00:56] i guess i could simply leave the numbers out [14:18:55] ryankemper I've depooled/restarted bg on wdqs1012 due to lag alerts. The lag is below alert threshholds now but it isn't dropping. Just a heads-up as I'm gonna be out for a few hours https://grafana.wikimedia.org/goto/SKpo-vPHg?orgId=1 [14:19:39] in fact, it seems like it's no longer reporting its lag metrics...hmmm [14:28:49] Thanks! [14:30:01] catching a bus, back in ~45 assuming my hotspot and/or bus wi-fi works [14:52:53] Time to start the weekend! It's getting too hot to work anyway. [14:52:56] Have fun! [14:59:29] ebernhardson: is completion suggestor available on all wikis? [14:59:41] cormacparle: yes, that's what powers normal autocomplete [15:01:28] and that's the action api with action=opensearch? [15:02:06] cormacparle: yup. Although note that the backend impl can change by user preferences as well [15:02:30] it's completion suggester by default, but Special:Preferences has options to make that stricter [15:03:13] there is also a rest api: https://en.wikipedia.org/w/rest.php/v1/search/title?q=exaple&limit=10 [15:11:11] hmmm looks to me like levenshtein distance of 2 or less is not gonna work for the thing this user is raising - hungarian and german will often have 3 characters different [15:11:26] does normal search do stemming? [15:11:34] yes, but it doesn't do partial words [15:12:20] it's also about 20-50x more expensive to run a fulltext query than a prefix/completion suggester query, so we try to avoid doing those searches per-keypress as well [15:14:00] gotcha - I think it might be ok to do fulltext just on the initial suggestion load rather than on keypress (as a first suggestion when I highlight text in VE) [15:14:15] but it still doesn't seem to be getting the stemming right, see https://hr.wikipedia.org/w/index.php?search=Svetog+Rimskog+Carstva&title=Posebno%3ATraži&ns0=1 [15:15:04] the corresponding that I'd hope it would find is https://hr.wikipedia.org/wiki/Sveto_Rimsko_Carstvo [15:17:19] it does stem, "svetog rimskog carstva" becomes "svet" "rimsk" "carstv" [15:17:36] WFB...working from bus [15:18:32] ebernhardson: aaah yes! I can see it in the results, but just not first [15:19:54] anyway we could make the levenshtein distance tunable in the query/config? or maybe make it proportional to the length of the query string? [15:20:10] so a longer query string would be allowed to be wronger [15:20:11] cormacparle: only by making it smaller, sadly [15:20:31] cormacparle: the search complexity goes up massively, it only supports a levenshtein of 1 or 2 [15:24:13] ok gotcha [15:28:28] the distance thing is puzzling me a little bit [15:28:29] https://hr.wikipedia.org/w/rest.php/v1/search/title?q=Svetog%20Rimskog&limit=10 returns nothing but to me it looks like it's a distance of 2 from "Sveto Rimsko " [15:29:17] (one insert and one substitution) [15:31:49] cormacparle: curiously, dumping the search response we can see the search engine returned it but it was elided somewhere else: https://hr.wikipedia.org/w/rest.php/v1/search/title?q=Svetog_Rimskog&limit=10&cirrusDumpResult [15:31:57] oh, doh no thats me reading poorly :P [15:32:16] hmmm also https://en.wikipedia.org/w/rest.php/v1/search/title?q=On%20th3%20roed&limit=10 gives me nothing, when afaics that's 2 from "On the road" [15:34:23] hmm, these should be long enough to trigger the 2 char (iirc less than 6 characters gets 1 distance) [15:36:29] Several eqiad wdqs servers are hovering right near the lag alert threshold...checking it out now [15:37:50] cormacparle: hmm, system is certain it's a distance of 3 but i'm not entirely sure why yet. [15:41:56] curious...we have it at `AUTO:3,6` which should mean, 0-2 characters is exact match, 3-5 is one edit, >5 is two edits, but `on th3 roed` is only converted to `on the road` if i reduce that to AUTO3,5 [15:42:31] but obviously the length here is ~10 [15:45:07] ryankemper I repooled wdqs1012...all the wdqs-main hosts are under the alert threshold. I'll keep an eye out but I don't think we'll need to do anything at this point [15:45:22] maybe something in the analysis chain...but we should be very light on changes there [15:45:54] unless is doesn't count the spaces and the 6 is added to the 3? [15:46:03] * cormacparle tries to think of a longer book title [15:46:25] cormacparle: yea i'm thinking it's something along those lines, there is an analysis chain there that tokenizes (and optionally removes stop words), so perhaps the spaces don't count [15:46:48] we do a query with and without stopword removal, so that shouldn't be causing problems [15:49:58] it really does seem to be only doing a distance of 1 [15:50:04] https://en.wikipedia.org/w/rest.php/v1/search/title?q=The%20Curious%20Incident%20of%20the%20Dog%20in%20the%20Night-Tino&limit=10 gives nothing [15:50:11] https://en.wikipedia.org/w/rest.php/v1/search/title?q=The%20Curious%20Incident%20of%20the%20Dog%20in%20the%20Night-Timo&limit=10 works [15:53:08] hmm, indeed same there if i take the bare query and replace the auto:3,6 with auto:3,5 it finds things...maybe opensearch changed the behaviour in there somewhere? [15:58:11] weird [15:59:30] i wonder...i think it's refusing to fix a short word multiple times? [16:01:18] doesn't seem like it [16:04:54] yeah this doesn't work https://en.wikipedia.org/w/rest.php/v1/search/title?q=The%20Corious%20Incadent%20of%20the%20Dog%20in%20the%20Night-Time&limit=10 [16:05:14] (changed Curious to Corious, and Incident to Incadent) [16:08:52] i'm still suspicious that the auto fuzziness is now per-term, but i haven't found a a smoking gun yet [16:09:00] we might just want to set it to always 2 instead of auto [16:09:56] should be easy enough to throw out a test next week and see if users see much change [16:10:25] that'd be great [16:10:27] well, maybe week after, i'll have to review and make sure this is changable from config directly [16:11:24] heh ok - I'll ping you in 2 weeks to check in maybe [16:11:47] sure. If i can change from config then ab test is just a deploy, but i might have to update and ship some cirrus changes. will see :) [16:11:55] s/deploy/config change/ [16:12:50] 👍 [16:38:18] randomly curious, the length of a string according to completion suggester: return text == null ? 5 : text.codePointCount(0, text.length()); // 5 avg term length in english [16:49:33] i have a sneaky suspicion that `AUTO` is always seeing that length of 5 for a null term... [16:50:41] completion.FuzzyOptions looks like it probably resolves the fuzziness without having the query term to determine the length [17:28:43] indecisive on how to run the test...the fuzziness is controlled by completion profiles which i can add to, but then those temporary profiles would be offered in the api. [17:29:08] they might end up in Special:Preferences as well, would have to double check... [19:22:01] Looks like the `retry_on: gateway-error` envoy setting reduced "upstream connect error" logs from Cirrus by about 10x. Would have hoped for them to go away, but from 500 to 50 per 12 hours is still good [19:26:25] curiously, they seem to primarily come from mw-api-int. Not sure why the proportions wouldn't be more like the overall query proportions (would see errors from api-ext/web) [19:33:21] oh i'm a dumb dumb...it's because the mw-api-int ones that are failing are mostly get requests fetching a list of ids, which is the saneitizer which also selects a specific backend and doesn't go through dnsdisc