[13:22:13] \o [14:55:41] workout, back in ~40 [16:08:54] * ebernhardson wonders if haswbstatement could be a little smarter and issue warnings about querying things that aren't indexed [16:09:18] that would probably require some extra db roundtrips though [16:12:52] Trey314159: would icu_folding and remove_empty be doing anything useful for wikidata statement_keywords? In https://gerrit.wikimedia.org/r/1064792 i tried to specialize the default lowercase_keyword analyzer to drop some tokens, but due to order-of-operations it seems to have lost those two token filters [16:13:31] i suppose icu_folding probably does [16:15:41] ebernhardson: I'm not sure if icu_folding is appropriate for statement_keywords, bc I'm not sure what it's for, but you do want remove_empty after it if you expect any kind of "interesting" input.. icu_folding can fold some punctuationish tokens to nothing. [16:18:46] Trey314159: statement_keywords is the P= field that is queried by `haswbstatement:`. tbh i'm not even clear on what all can end up in the section [16:18:59] The types are https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/2462696db2d5023ef6753aab1bc38219f49a4390/wmf-config/SearchSettingsForWikibase.php#63 but that still doesn't say much to me [16:20:06] looking [16:23:19] * ebernhardson tries checking with presto...and gets annoyed how presto just has to have a different syntax than hive/spark [16:30:46] wow, separately supprised to find that with the hadoop cluster reasonably idle and no explicit limits on a pyspark cli shell...it gave me 3k containers to run one query [16:31:56] and gives a result i wasn't quite expecting...there are currently ~490M unique statement_keyword values [16:32:07] lunch, back in ~2h [16:32:42] I guess if can be a string, then icu_folding could be useful, though I couldn't find any interesting examples. I found a query with a DOI, but in cases where you could expect a variety of characters, like a name, there are Q-numbers (like, "George" is Q15921732, so you search for Q15921732, not "George"). [16:33:00] Hmm. can you tell how many values are *not* Q-numbers? [16:33:27] hmm, yea a regex should be able to match P=[^Q]. sec [16:37:30] Trey314159: sample of 100 such values: https://phabricator.wikimedia.org/P67739 [16:38:25] it's counting the total number [16:39:12] separately i'll have to check with david when he gets back what we think about having half a billion unique values here...it's a bit surprising to me :) [16:40:19] Trey314159: excluding ones that start with the letter Q, 361M [16:40:45] The fact that they can include external ids and URLs means there's no limit on the number, I guess. In your sample, line 39, "Peraküla Allikajärv" is making the case for ICU folding. [16:41:00] 361M/490M? Wow! [16:42:08] Actually, that kinda makes sense if it is 490M unique. "Peraküla Allikajärv" might occur once, while Q### might occur thousands of times. [16:42:16] yea [16:44:40] in total there are 1185M statements [16:45:54] i suppose i should then try and understand what the appropriate way is to extend an analysis chain, i was hoping i could simply copy the lowercase_keyword chain in the hook, but it seems like we call the hook before the chain is fully built [16:46:07] I don't think your sample of 100 is quite random since there are at least 8 Harry Potter items in a row, so I wouldn't say there are only ~1% "interesting" strings.. but I'd expect a lot of boring ids and urls which tend to not have interesting characters in them. For the interesting ones, ICU folding would be nice to have.. though it is really hard to say how often it'll come up in real life. [16:46:40] yea it's not really random, they all come from a single partition and it probably concats the statements from the first few Q-items it runs into [16:47:34] i suppose i can get a properly random sample, moment [16:47:47] it's weird that the analysis chain exists and is around for any length of time before being finished. It's not something I've run into.. but I've never really worked with hooks or other things trying to jump in and out of the process. [16:49:53] On the one hand, I don't know if we really need a randomer sample. On the other hand, if there are 3000 hadoop containers with nothing better to do, I would be interested in seeing the list... [16:50:21] yea it runs surprisingly fast with 3k cores and 5tb of memory :P [16:54:26] Trey314159: a properly random sample: https://phabricator.wikimedia.org/P67741 [16:58:11] it looks like in 2018 we had 102M distinct values, so it's 5x'd since then [17:01:44] That's interesting! And, yeah, there are plenty of names with diacritics in there. How often people look for them is another question... [17:02:18] i'm going to make the wild guess that well more than 90% of these terms have never been searched for. But no way to know which ones :P [17:03:24] ^yeah, that sounds quite likely [17:16:55] * ebernhardson decides that with 490M unique values and 12k total P values...it's not worth a custom analysis chain to drop them even if they wont be queried [20:43:29] ebernhardson: okay if i schedule some time with you to discuss the opensearch migration? this'll be for T370148 . i'm in just a couple days next week and will mostly be doing the data platform strategy stuff and catching up on the many backscrolls, so would schedule it for the week following next [20:43:29] T370148: Create high level plan for the migration of the Elasticsearch search cluster to OpenSearch - https://phabricator.wikimedia.org/T370148