[02:42:06] ^ we'll prob need to ask dc-ops for help [08:02:27] pfischer: are you around for our 1:1? [08:03:10] gehel: sorry 3 min [08:03:15] ack [09:38:30] errand+lunch [13:12:22] o/ [13:59:54] \o [14:13:25] o/ [14:22:41] o/ [14:37:19] if someone has time I have https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexemeCirrusSearch/+/1126111 that should be ready to go [15:00:01] awesome, taking a look [15:01:28] i wonder if we could come up with a shorter name for `lemmaspellingvariant` [15:01:37] i mean it's fine, but seems verbose to type out [15:02:22] maybe a simple `lemma` that searches both? i suppose i'm not entirely sure of the use csaes [15:03:28] alternatively, i wonder if there is a use case of selecting the language when searching variants [15:05:57] ebernhardson: yes... I've been pondering, the reporter suggested "haslemma" I found it a bit ambiguous and not clear that this keyword could filter on the lemma spelling variant [15:08:16] i suppose, it doens't look like we are storing the language in the index, or is that stored elsewhere? [15:08:36] the test cases at least suggest it's simply the lemma variant strings [15:08:53] I'm not sure how we would handle that though...hmm [15:09:19] so https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseLexemeCirrusSearch/+/1126111 is starting to add the data to a new field "lemma_spelling_variants" [15:09:48] right, but it's storing 'Test lemma' with no mention that it's for en-gb [15:09:54] oh [15:10:12] and the ticket is asking about language filtering as well [15:11:39] we could have a separate field that just has the list of available languages, but i fear that might be unsatisfying to end users, they could find a lexeme that has en-gb, and has 'Test lemma', but they aren't matched. But then lucene isn't great at that kind of match without some weirdness in the backend [15:11:44] lexeme language filtering is done thanks to the existing inlanguage that now has a config to expand its search space: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1121666 [15:11:58] oh! i'm so behind on these... [15:12:43] for lemma spelling variant it does need additional data [15:12:56] so they will be able to filter for lexeme's that have a variant in en-gb, and a variant that has 'Test lemma', but no guarantee they it's 'en-gb:Test lemma' [15:13:18] not sure how important that would be [15:13:19] I wished I could have added a subfield on the existing lemma field but that's not possible (it would not have been super usefull at search time anyways) [15:13:52] yes... we can't do this kind of structured searches... [15:14:14] from the ticket I got the sense that they just wanted an additional on the spelling variant [15:14:23] additional filter* [15:14:28] well, we could with some funkiness. Like we could index as `en-gb:Test lemma`, have a subfield that strips `en-gb:` so we can do plain lemma searches, and have a keyword that constructs the combined term? [15:14:56] or index as two fields if stripping the en-gb is awkward [15:15:06] we could I think? [15:15:31] I think we have this kind of transformation for the wikidata statements IIRC [15:15:53] yea we have a similar thing for turning P1=Q1 into just P1, similar but different [15:16:19] I guess I can go back to the ticket and ask for clarification [15:16:41] yea it really depends on which use cases they want. I think this current patch solves some use cases, not sure if they have the other use case [15:17:21] we don't appear to have a keyword to filter on the lemma itself [15:17:28] which is strange... [15:17:58] I suspect lexemes with multiple lemmas are quite rare? [15:18:31] i was curious as well...random internet search claims idioms, phrasal verbs, and multi-word expressions are also lemmas. Like "looked up" or "kicked the bucket" [15:18:37] but i dunno if wikidata lexemes capture that? [15:18:49] Trey314159: any idea? ^ [15:19:09] this is basically claiming anything with a single "meaning unit" is a lemma [15:19:57] there are phrase verb [15:20:04] https://www.wikidata.org/wiki/Lexeme:L1353405 [15:21:00] ahh, so indeed they are. And should we match sub-lemmas? Like should `look` find `look after`? I suppose i was pondering how that works with constructing a term like `en-gb:look` in a keyword, sounds like it won't work naively [15:21:59] yes this becomes trickier... [15:22:01] ebernhardson: looks like dcausse found an example. Phrasal verbs and idioms that are not compositional (e.g., "look up" doesn't just mean looking upward) should have lexeme entries. [15:22:45] Trey314159: thanks! somehow i missed all these things about lemmas in school... :P [15:22:54] lol [15:23:42] i dunno...i suppose we should ask the user about their use cases? We can maybe come up with something but it will probably require some thinking [15:25:27] sure [15:25:48] will ask something, also about the naming of the keyword [15:32:30] would almost need like a custom edge n-gram or some such, that indexes `en-gb:look after` as `en-gb:look`, `en-gb:after` and `en-gb:look after` [15:32:49] possible, but probably not worth it unless they really need it [15:34:27] sure [15:34:54] could be nice to have this token filter, I remember we discussed similar hacks in other contexts [15:35:11] yea does seem to have come up before [15:35:19] was probably to avoid having hundres of language field [15:35:25] *hundrers [15:35:29] sigh... [15:35:37] hundreds* [15:35:52] but gets into the question of having per-language analyzers...not sure how that would work [15:36:13] yes that's where we stopped on this idea I guess :) [15:37:29] one silly idea: pre-analyze with the _analyze api and index bare strings [15:37:47] well, the prefixed strings [15:38:40] i suppose in theory a custom plugin could also do that, but i suspect it would be 5x the work [15:39:17] :) [15:41:12] * inflatador is having internet issues today [15:42:18] calling the _analyze could be a bit tricky... might not play well with how we manage mapping updates/reindexes and the like [15:42:32] oh well, back to my much easier problem of re-working mjolnir cli argument parsing...at least this i think i can finish today, the main question will be how many test-run iterations it takes to correct all the mistakes i made [15:42:44] :) [15:43:53] i think we can _analyze against an index alias, but i suppose the way we mangle the names makes it significantly more difficult [15:45:46] i guess if it was a plugin we could include a mapping of language to analzer name in the analyzer settings...but again plugin is going to be complex :P Probably too much without a better use case. Maybe useful if we think it could also solve the hundreds of per-language label and description fields [15:46:07] but it's a significant undertaking, probably not for a single ticket :) [15:46:53] and the hundreds of fields hasn't turned out to be too much of a problem after we did the analysis-chain dedup [15:47:24] maybe the answer is hundreds of lemma fields? Although not super happy about adding more...oh well probably thinking too deeply about this [15:48:37] * Trey314159 constantly tries to make the analysis chain dedup less effective by adding more and more custom analysis config.... [15:48:59] there are probably still 100+ languages that dedup into `aa_plain` [15:49:13] so you have some work to do :P Next up: atj [15:49:19] indeed.. I worry about it a little, but not too much [15:49:43] 6,200 native speakers of atj...i suspect we will never get there :) [15:50:46] * ebernhardson is a little sad there is no IPA on the enwiki page, i can't pronounce Atikamekw [15:54:58] ebernhardson: https://lingualibre.org/wiki/Q81626 :) [15:55:33] nice! [15:55:46] lingualibre is pretty nice [16:11:50] I'm better at reading IPA than transcribing it from a recording, so even hearing it said I'm not 100% sure I got it. Wiktionary has IPA for the name in English (/ətɪkəˈmɛk/), though I think the spelling and the native audio hint that it ends with /kʷ/. That said, lingualibre is indeed nice! [16:26:59] lunch, back in ~1h [16:49:13] heading out [17:22:18] aww...hive metastore 3.0 adds an `information_schema` table where we can ask in a robust and structured manner what the partitioning keys of a table are. Of course we have 2.3.6 [17:22:58] it's parsable from `describe `, but feels imprecise [17:36:30] back [18:26:47] There's been no update to https://etherpad.wikimedia.org/p/search-standup for a while. Could you please make sure you let me know in there what's going on? Thanks! [19:05:01] added what i've been doing this week, or at least what i could think of [19:05:26] there's also some (minor) SRE support for the opensearch migration, but didn't seem worth adding [19:11:43] ebernhardson: thanks! [19:32:10] And I'm off on vacation! Back in a week! Enjoy! [19:46:12] .o/ [19:47:16] ryankemper been working on the shard checker script. It'll need to be renamed since what we're really doing is printing a list of hosts we can safely reimage, but I'm making progress: https://gitlab.wikimedia.org/repos/search-platform/sre/cirrussearch_shard_checker/-/blob/main/cirrussearch_shard_checker.py?ref_type=heads . If you'll be around today, maybe you can help fix it up? [20:07:33] sure thing! [20:09:27] cool, fixing it up now so it can handle all 3 endpoints instead of hardcoding the one [20:19:01] * ebernhardson spendt way too much time reading the gitlab stacked patches issue (https://gitlab.com/gitlab-org/cli/-/issues/7473) and still am not sure what i should use...there existing half-solution seems...well half-baked [20:20:17] ryankemper I just started a branch for handling all 3 endpoints. I'm taking a break, feel free to hack on it now, otherwise we can work on it in pairing https://gitlab.wikimedia.org/repos/search-platform/sre/cirrussearch_shard_checker/-/tree/all_endpoints?ref_type=heads [20:43:45] I guess I should wrap the whole thing in a loop, instead of trying to change every single function to work with multiple endpoints [21:04:59] yes indeed [21:05:04] inflatador: https://meet.google.com/ozu-gdro-zxg [21:05:23] ryankemper OMW, okta just logged me out ;( [21:05:29] classic okta [21:36:44] same...and then slack does this weird thing where i get logged out of wikimedia slack and it opens up into a different slack that i'm on [21:37:03] like, not even refreshing the page, i just tab into it and it's a different slack [21:59:29] (╯°□°)╯︵ ┻━┻ [22:25:22] cirrussearch2078 is reimaging in the -single window, it is having problems with PXE booting , cirrussearch2097 is in -multi and should be done soon, feel free to restart ferm, make sure cirrussearch2097 joined, and force shard reroute when it's done