[10:19:44] lunch [13:13:19] Greetings [13:33:54] o/ [13:59:01] Still working on the docs from yesterday, I added swiftly info to https://wikitech.wikimedia.org/wiki/Swift/How_To#Fine-grained_object_deletions_with_Swiftly and will link to it on the streaming updater page soon (or whatever you think is best) [14:02:23] thanks! [14:03:42] it's nice to link it from the streaming updater page I think, ideally all alerts should have some runbooks/help section to follow when they're triggered [14:22:44] Yeah, 100% agree. OK, it's linked [14:22:52] Heading to coffee shop, back in ~20 [14:31:46] reminder: today is Wikimedia Connect! [14:36:38] \o [14:55:34] back [14:55:36] o/ [15:07:05] random idea, language team has a 'Quarter Backlog` column, maybe thats a closer fit to how we have longer-term grouping of tasks to complete [15:15:27] dcausse: curious if the 429 idea seems reasonable, issue exceptions from the query rewrite stage (which should be per-shard), exponential backoff in the client side. Thinking to put this all directly in the yarn side with some sort of parameter for how many partitions to use. Not sure yet how to decide which cluster to query [15:16:19] i suppose i have to test to verify rejecting from one shard short-circuits the other shards rather [15:16:48] i dunno, maybe keep the msearch daemon... [15:23:59] ebernhardson: oops haven't read backlog about that yet, reading :) [15:25:49] the issue is, as we've mentioned, mjolnir msearch daemon is fully-paused now since it's thrshold is 100qps and codfw sees 1k-2.2k in active-active. The easy solution is increase qps and decrease parallelism, let it run a long time. Harder is some sort of feedback mechanism that tells the client when to slow down [15:26:04] err, increase qps threshold [15:27:59] do we have to send all searches to the same DC? [15:28:11] or can we spread them a bit [15:28:27] we don't have to, i can't ask kafka but according to tickets we have 25 partitions in kafka. There are 8 consumers per dc so it can spread out [15:28:56] i suppose i can ask kafka somehow .... must be metadata. but it's probably 25 [15:29:29] i was more thinking the kafka-topics command requires talking to zookeeper, which is firewalled away from places i can login to [15:29:40] the 429 idea is to have a small proxy that sits in between analytics & and elastic [15:30:10] dcausse: no, it's to reused the DegradedRouterQuery i wrote a few years ago, it has the ability to re-write queries based on cpu load, query load, or latency percentile of a stat bucket [15:30:13] there would be a request header saying do not run me if load is > X [15:30:23] dcausse: we can have it also throw an EsRejectedExecutionException, which turns into a 429 [15:30:29] ah [15:30:53] i already wrote up some test code to verify that works, it does result in a 429, but i still have to verify if that tells the other shards to early-exit [15:31:19] i suspect no now that i think about it :( [15:31:29] so that means we wrap the search requests using that router [15:31:33] yea [15:31:53] I like the idea, could we rewrite earlier? [15:32:14] and instead of using the load using some other metric [15:32:21] i'm not sure how, rewriting from the coordinator node would be a weak-proxy [15:32:59] if we're using another metric like qps in the enwiki index? [15:33:12] not sure we can access that easily tho [15:33:52] yea not sure how to easily access cross-cluster communication [15:33:58] err, intra-cluster communication [15:34:05] basically howto query other nodes :P [15:34:09] yes... [15:34:57] assuming we have something that works, this would be nice, the client would just pause on 429 or even try the other DC [15:36:19] ok, i'll probably poke at it a bit more, but i suppose trying not to today for connect [15:36:48] this other alternative is to keep msearch and increase the threshold basically and still use a single DC? [15:37:40] yea, but thats a bit awkward, codfw high is higher than eqiad low, would probably have it set such that it only queries when codfw is not at peak [15:37:53] :/ [15:38:53] and then it's mostly guessing about how much to set the limits to, i imagine we have to guess and look at load, and leave it somewhere sane-ish. More error prone and manual up-front tuning, but probably works [15:39:03] ok, overall this 429 idea seems nice to me, it sounds flexible and more reacitive to load than the current on/off approach of msearch [15:41:22] going offline early, have a nice week-end [15:41:48] take care [16:52:39] Lunchtime, back in ~1 hr [18:18:41] * ebernhardson notes we have ExpectedIndices.php, and CheckIndexes.php. Can't decide on how to pluralize :P [18:24:25] to our credit, the rest of the world can't decide either although indices seems to be leading since the 60's: https://books.google.com/ngrams/graph?content=indexes%2C+indices&year_start=1800&year_end=2019&corpus=en-2019&smoothing=3 [18:25:10] although lucene has 1335 mentions of indexes, and only 199 for indices [18:44:03] I vote for "indexes" because it avoids the "indicee" problem. If you talk about "indices" long enough, someone will eventually unconsciously try to depluralize it as "indicee". When I taught math, the same problem with "matricee" would come up now and again. That said, I probably use both "indexes" and "indices" regularly, and I actually prefer "matrices"—"Do I contradict myself? / Very well then I contradict myself, / (I am [18:44:03] large, I contain multitudes.)" [19:18:07] back [19:28:29] Hi!  I this feels like the right place to ask this - [19:31:39] I'm doing a conference presentation on search.  I wrote an app that downloads all the wikipedia articles, cleans them, and indexes them into a solr 9 installation [19:32:23] I've architected a bunch of search implementations in the past, and the point of the conference talk is to encourage people to jump into search engine development [19:33:26] so - is there anyone on this channel who might be able to tell me about any painpoints they've had with search?  any major challenges they run into? any features that they think could help improve search results? [19:36:10] Basically, I'm trying to figure out a cool way to implement a more advanced feature of solr to help make the result more relevant than the OOTB BM25 ranking system.  There's a lot of data I see that could potentially help (location info, keywords from the redirect info, etc).  If anyone is interested, I'd love to pick anyone here who works on [19:36:11] search [19:37:17] the code will be downloadable and available through my github account.  I'll make it open source.  It'll be my one-trick-pony for future presentations [19:38:39] Kristian: ebernhardson and Trey314159 might have ideas to share. dcausse as well, but it is definitely too late for him (he is on CEST). If no one is around on a Friday afternoon, you can try again next Monday. [19:39:05] oh no rush!  I have 3 weeks to get it done [19:39:09] Or join our Search Platform Office Hours! Next one will be on October 5 and will be announced on https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours [19:39:12] but this helps A LOT [19:39:42] I'll be there - my presentation is on the 12th, so I might try and make a convo earlier [19:39:43] October 5 might be a bit late for your timeline... [19:39:46] haha yeah [19:39:52] at least it's not last minute! [19:40:20] I mean, I can whip up SOMETHING between the 5th and 12th.  It wouldn't be my first conference code download that has shameful hacks [19:41:27] I don't need it to be anything super-fancy, but I'll get more kudos if I can create some sorta training model from existing metadata to use a learn-to-rank feature.  I do see that you also have some click data and location data available, for example.  Both of those can def be used [19:42:46] but i'll focus my questions on what the search team experiences and what they'd like to have... see if that's anything worthwhile to add to the presentation - nothing like trying to solve a real-world problem that can help [19:44:33] A few things that we'd like to do to improve the Search experience (from the top of my head): [19:45:25] sweet!  Already getting good info [19:45:26] 1) better assistance in writing query, contextual auto complete [19:46:24] 2) we always want better support for more languages, especially languages that are less represented, less supported by mainstream [19:46:42] re:#1 I've worked that before - the first version on etsy in 2010, but I too have trouble with that at my work.  I still do [19:46:51] in terms of historically what has made the biggest changes, better tokenization especially in non-latin languages. popularity information was also a big win, but only when it was done using LtR, applying it as a naive boost wasn't nuanced enough but decision trees made better decisions about when to use popularity [19:47:24] 3) diversity search, exposing searchers to more topic and not just the most obvious one [19:47:26] what search engine do you use? [19:47:39] elasticsearch, we helped write the LtR plugin for elasticsearch so we could do that [19:47:42] we're using Elasticsearch at the moment [19:48:24] I'll let ebernhardson take over, it's way too late to be on IRC here :) [19:48:35] lol, yea you should sleep :P [19:48:41] ahh that's so cool... yeah I've been on solr since 2010 ("they keep sucking me in").  I love ES though, but just don't use it day-to-day now.  That might be too close to a deadline for me to include [19:48:56] it's 10pm, not time to sleep yet, but definitely time to be off IRC [19:49:00] gehel thanks so much.  Totally intrigued by tech there days [19:49:09] we initially intended to deploy solr prior to elasticsearch, but operationally it just wasn't there. That was probably ~2011 or so [19:49:17] like, been working on this for about 2 days [19:49:23] solr required too much hand-holding and we need thousand of indices [19:49:32] (we have ~1k wikis to search) [19:49:38] I did solr in 2010 at etsy, we had the branch version of it - we didn't like the mainline ones and it was a PITA [19:49:54] it was 1.4 back then [19:50:10] It's far better now, but def shows a lot of signs of aging [19:50:24] Hey, Kristian.. I'll be jumping in shortly with some thoughts.. but I need to ponder a bit [19:50:39] sweet!  thanks too!  This is exactly what I wanted [19:50:53] depending on which set of users involved, the editors want extremely exact matching. We give them regex's but those regularly timeout on large wikis. [19:50:58] I'm trying to teach myself some of the semantic search aspects of solr right now [19:51:34] trey is one of our regex culprits :P He has a thing that finds mixed-script words to fix [19:51:35] oh.. interesting on the regex part... [19:51:36] (as an example) [19:51:42] I am soooo awful at regex [19:51:53] I've always had a coworker who sat next to me who was better [19:52:00] I'm glad gehel and ebernhardson both mentioned language-related things! [19:52:09] so I would hold my hands up and use that ego trick to have them do it for me [19:52:15] :) [19:53:05] My current employer is only english-based searches.  the only cross-cultural thing I've done in the past was currency conversion stuff in solr.  came up with the design for that in solr, but it's since been taken out:/ . [19:53:24] so abotu the other languages stuff [19:53:34] are you ever doing any translations? [19:53:36] i suppose other big wins, perhaps obvious but when we switched from stict prefix matching of titles to a fuzzy levenshtein-distance based search (using FST's, completion suggesters in elasticsearch) that made a huge difference [19:53:48] thats only for autocomplete of titles though [19:54:02] we do transliterations, particularly in chinese but maybe elsewhere. But not translation [19:54:10] got it [19:54:53] we do some language detection, but that is limited to what we call poorly-performing queries, suggesting users try their search (with links) on a language that their query looks like. But language detection of short strings is tricky [19:55:22] can you tell me what sorta LtR features you use to help with your model?  Like, I imagien the click data, IP address data, location data, and a few more pieces of data can be helpful [19:55:53] especially since a lot of articles use other languages for it's names.  Like product names, right? [19:56:12] hmm, i exported models for someone a few months ago, you should be able to get the feature lists from here: https://people.wikimedia.org/~ebernhardson/cirrus_models.20220518/ [19:56:13] (first thought on that - could be totally off - but my first guess) [19:56:34] that's so awesome [19:56:51] the features we use are pretty naive, ~250 different features are collected and an algo called MRMR is used to decide 50 for each wiki. We sadly haven't really dove into the details of what gets selected to see what would be better [19:57:33] oh, if I even show one of these features with a solr index it would be all I need. [19:57:42] the set of defined features is here: https://people.wikimedia.org/~ebernhardson/cirrus_models.20220518/ [19:57:51] doh wrong link :P sec [19:57:53] https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/featuresets.py#L230 [19:58:10] I got it now to a point where it parses all the data ok - do you mind if I ask - how do you clean the wikimedia content to be in the ES index? [19:59:06] mostly this bit: https://github.com/wikimedia/mediawiki/blob/master/includes/content/WikiTextStructure.php#L153-L180 [19:59:12] Right now I used a projeect called wikiclean - it worked great from what I can see.  so it's not a painpoint for me at all.  But I'm sure I didn't do the best implementation [19:59:43] we take the rendered html content, remove some things based on css selectors, rip out some non-content data into a secondary field (again, css selectors), then strip all the html [19:59:59] non-content is things like thumbnail captions, etc. [20:00:02] tables [20:00:28] you can see the end result here, these are the actual indexed docs (needs a json formatter): https://en.wikipedia.org/wiki/Main_Page?action=cirrusdump [20:00:30] oh wow - I was thinking about doing that, too... that's what I do now [20:00:36] you can use that action on any page [20:01:08] that's really cool [20:01:10] we also have dumps, you can download a direct dump of our elaticsearch indices here, they are formatted such that they can be piped into elasticsearch _bulk endpoint: https://dumps.wikimedia.org/other/cirrussearch/current/ [20:01:23] this is amazing [20:02:15] I'm embarrassed to say, I've not used ES in about 8 years [20:02:24] i've not used solr ever, so we're even ;) [20:02:28] hahahahah [20:02:49] the solr/elasticsearch decision was made ~a year before i joined the team [20:02:52] ES people are nicer - I presented at some ES conference awhile ago and they were more than happy to have a solr presentaiton [20:03:25] i wore an elasticsearch at a solr conference, back before it was renamed to activate. Got a few funny looks ) [20:03:31] an elasticsearch shirt [20:03:58] you may have made the right choice.  it's a hotter search engine now.  Solr isn't getting as many committers these days and everyone raves about those cute charts it makes from that snazzy front end [20:04:21] hahahah activate is now ONLY virtual now [20:04:23] yea but solr is where relevance happens, i went to solr conferences because the talk about relevance at elasticsearch conferences was really quite naive [20:04:26] so joke is on them [20:04:28] at least, at the time. [20:04:40] oh, it's pretty deep at the solr stuff [20:04:57] almost everyone with a big solr engine writes a ton of query parsers, security plugins, and tokenizers [20:05:34] there are a lot of great java libararies too, they seem to be easier to use if I had to guess [20:05:58] I mean, it took about 2 nights to get to the point where I felt OK to mention it here [20:06:26] i suppose a random plug, but usually these days i get relevance ideas from https://relevancy.slack.com/ [20:06:41] ahhh cool... [20:07:20] Coming in late and from a random direction.. I think one of our big challenges is understanding what searchers want. We constantly have new users who we can't easily teach or train. We also have bots (API and screen scrapers) using our services; that's generally a good thing, but bots can make ridiculous automated queries in bulk, which can skew our ability to understand what real humans want and need. [20:07:22] Do you ever use location data with one of your feature sets? [20:07:34] One example is our zero-results rate, which we originally wanted to try to decrease dramatically. After looking at the data, though, there are plenty of queries that deserve no results: odd bot-made requests, broken tools sending malformed queries, cut-n-paste errors (like 1000+ characters of text from a book), repetitive scanning tools, and unintentional denial-of-service attacks from overly enthusiastic researchers with a weird [20:07:34] idea and a rack of servers. [20:07:42] Another example: We had to "retrain" our sophisticated users to use "\?" instead of "?" as a wildcard, because naive users ask questions (with question marks) and get unhelpful results when they don't realize they are requesting an extra letter on the last word in their question. [20:08:08] We don't currently use location data. [20:08:25] while i suspect personalization could show improvements, it's hard to apply in our context. Additionally at least for our editor cohort they find it important that two editors with the same search string get the same results [20:09:02] I was going to try that out for fun.  I was thinking of using location data for multiple areas - location from the IP of the searcher, then the location of articles (when available in metadata), editors locations based on IP... see if something comes out of that [20:09:06] We've thought about it, but we generally shy away from using user-specific data of any kind for privacy reasons.. plus what Erik said about editors wanting the same results all the time. [20:09:31] that makes total sense [20:09:43] but you use click data as well? [20:09:45] i suppose our version of location is 900 language wikis, the french wikipedia has much better articles about france than english :) [20:09:56] indirectly, click data trains the ML models [20:10:08] ahh that's really cool [20:10:12] An interesting idea was for searching on Commons. So when you search for "taxi" in the US, you might get a yellow taxi, but in the UK you might get a black cab.. etc. [20:11:13] so is that using some sorta semantic searching to connect taxi to black cab?  or you think it's mainly location with "cab" being a synonym for "taxi"? [20:12:44] i would probably try to use location boosting, and hope that pictures in commons have location data (many do), but haven't really experimented [20:13:14] Sorry, in my mind "black cab" is a specific thing (in London), and "yellow taxi" is a specific thing (in NYC). But the idea would be to use location data from the user and geotagging on the image, and structured data identifying the images as depicting a taxi/cab and .. voila! [20:13:36] As Erik said, just an idea.. we haven't looked into it yet. [20:14:20] i suppose of note that searching on commons (the media archive) is different. while the large wikis use LtR, commons does query expansion using wikidata (and maybe other stuff). You can see the kind of query that gets generated here and all the expansions, both with text and wikidata statements (the statement_keywords field): [20:14:22] https://commons.wikimedia.org/wiki/Special:MediaSearch?search=example&cirrusDumpQuery [20:14:47] Pxxx=Qxxx are edge's of the wikidata graph [20:15:07] or the commons graph...there are actually two [20:15:21] In response to your original question, it's actually hard to come up with big general problems that need big general solutions, because search works pretty well if you have decent language processing for a given language. [20:17:08] interesting - so your team does both commons and wikipedia? [20:17:21] One thing that still kinda grinds my gears is that we don't have a good general purpose tokenizer than works for all languages and scripts. The ICU tokenizer is amazing, and does decent segmentation for lots of languages, but still has some stupid bugs (which I can't convince anyone upstream to care about). [20:17:23] commons search was done by another team that we support [20:17:24] I guess that's a yes looking at the link above [20:17:43] it's on our servers, but they wrote the query expansion and built a UI (our team doesn't have any frontend ppl) [20:17:51] and uses the indices we maintain [20:17:52] I can't imagine an all-language tokenizer to ever come out [20:18:07] I'm really, really bad at front end [20:18:15] vi4lyfe [20:18:19] :) [20:18:55] @tre [20:19:44] Trey314159 you ever use the "this is to help general knowledge of likely a billion people" as a way to get them to care?  I bet that wouldn't work either [20:20:29] I'm shocked at how many languages the ICU tokenizer does handle. Other tokenizers (or stemmers, or other filters) can do really bad things, like throw away all characters not in Latin or Cyrillic (it was an easy fix, though!) [20:21:00] a problem that trey is perhaps aluding to, you will find a hundred different languages on any major wikipedia or wiktionary. Especially in other languages trey is often fighting analysis chains in lucene that never considered what happens when other languages run through them [20:21:09] oh, i've thrown away latin and cyrillic letters - or just convert them to similar looking characters [20:21:23] things like names, poems, etc. [20:21:46] I feel like this conversation can make a really cool article hearing about all this stuff [20:22:04] Yeah. there are easily 40-50 scripts (and 50+ languages) on any of the top-10 Wikipedias. [20:22:09] you have no idea how much I'm feelin' your pain on these points and how interesting I find the problem [20:22:49] I've written a few blog posts over the years: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Blog_Posts [20:23:20] Other links on that page go to a lot of my other work, mostly language stuff, but some zero-results stuff and other things, too. [20:23:48] this is really cool [20:23:59] I'm on the page now [20:25:56] I'm diving more into the more advanced features of search - specifically learning more about matching based on the sentiment instead of BM25/TF-IDF matching.  I've tried before in the past but you really had to go to a university or a meetup to get information.  But now that it's so much more important (and advanced!) I'm diving back into it. [20:25:57] Hence, why I'm really appreciating this information [20:27:14] BTW, the ICU tokenizer bug that gets me is fairly subtle and only causes problems with mixed-script text with tokens that start with numbers.. so it doesn't affect everyone and it's often not a big deal. But it means that in "x 123,456,789 3.14159 2.71828 3a", the last two chracters will be tokenized as "3a".. but in "д 123,456,789 3.14159 2.71828 3a", they will be tokenized as "3" and "a". Not intuitive.. possibly easy to fix.. [20:27:14] but the discussion turned into a slap fight so I gave up. I just hate it because it is so *wrong*... [20:27:15] I guess you can say I've done some advanced features, but none of it had to do with training models.  Some minor NLP, but this is going to be a lot to chew and prove to be fun. [20:27:57] but you want it tokenized as "3a" then, right? [20:28:16] I do, yes. [20:28:21] (in the business world we say "that's a new feature and we don't support tokenizing on that") [20:29:04] for lucene, I'm sure patching the tokenizer could be possible.  When creating the tokens, there could be some sorta rule engine you can make to handle how it handles exceptional cases it can identify [20:30:06] like, I do know I've had it tokenize on numbers in the past where it would make a "3a" "3" and "a" token form the "3a" entry [20:30:20] that is in solr though, that could be from the solr and not lucene code [20:30:38] It may be too late to help your presentation, but you should definitely stop by our office hours (usually the first Wednesday of the month) and we can chat about this stuff or anything else! [20:31:01] I had a similar problem when working on medical equipment - the identifiers were often searched for and had both numbers and letters in them. [20:31:08] tbh i never could wrap my head around sentiment based search either :) there are interesting things happening there, especially with dense-vector based representations, but i suspect the later is much too expensive for our query rate (~500qps full text, up to 1k depending on how hard the bots hit us) [20:31:29] As for "3a", the problem is that changing that initial x to д changes the tokenization of 3a at the end of the string. [20:31:40] ahhhh [20:31:55] ok, I figured it would be a different issue [20:32:31] yeah, I know the presentation will be fine and it won't be for a big audience.  I'll eventually (hopefully  I don't get lazy about it) expand it to something cooler [20:33:01] I generally prefer keeping letters and numbers together in tokens (at least in languages with spaces), but I can understand the opposing view. My problem here is the *inconsistency* and long-distance dependency on the cause of the inconsistency. [20:34:15] by default solr does tokenize with the letters and numbers, but there's so many tokenizers to choose from and it's not bad to make your own [20:34:38] so regarding your query rate - [20:34:49] do you separate bot traffic from non-bot? [20:35:03] I've done a lot of analytics, there's def a lot of ways to identify that [20:35:37] not just with the httpd header too since bots can lie easily about who they are.  that whole httpd header stuff has a lot of trust behind it [20:36:04] but earlier I saw that some of the training data was thrown off due to bot traffic queries [20:36:43] not too much, there are some naive attempts to separate them out but nothing particularly amazing. Our site usage policy asks them all to put the word 'Bot' in the user-agent but as you can imagine that's of limited use :) Our training data isn't too effected by bots because we only use web search, and bots can easily use our API's instead [20:37:25] ahhhhhh [20:37:40] ok cool.. so most of the bots play nice and just use the APIs [20:37:42] training data is web search + clickthrough, bots tend to use the api and even if they use the web interface they don't tend to click through. The rest of the filtering is mostly volume based, if your IP issues 5k queries a day we throw it out [20:38:17] To nerd out a bit: the ICU tokenizer splits on character set changes, so "xдxдx" would be 5 tokens. I disagree, but I understand. Numbers and punctuation don't have their own character set, so they inherit the character set that sequentially comes before them. [20:38:19] The problem is that they don't reset the character set to "none" when they cross a space or other character guaranteed to split a token. So in the first instance, "123,456,789 3.14159 2.71828 3" is considered all "Latin" because it follows an x, while in the second instance it is all Cyrillic because it follows д—which causes different behavior when it reaches that last "a". [20:38:29] ahhhhh... that's a pretty simple way to filter out IPs that would mostly be effective [20:39:45] Also, depending on what we are gathering data for, we sometimes only take one query per IP per day, so that if a bot slips through it doesn't skew our data too much. [20:39:51] Trey314159 - digesting what you said about ICU tokenizer [20:40:26] No worries.. it's nerdy and weird... it's nice to just vent sometimes. [20:40:52] vent away - I may ask a couple 101 q's since I don't do much internationization work [20:40:58] (or correct my own spelling) [20:41:02] No problem [20:42:59] so getting to the basics - you're talking about when you're tokenizing 2 different character sets.. one latin and one cyrillic (in your example) - and the ICU tokenizer will tokenize the numbers and punctuation based on the character set done just before them... but they don't (in the input) reset the character set back to "none" so it causes the [20:43:00] ICU parser to puke [20:44:20] that's what you're sayin, sî? [20:45:00] Not puke like throw an exception - it just produces garbage tokens because of it [20:45:29] Pretty much! [20:45:53] I got it... I wonder if the tika parser ever dealt with this before [20:46:20] It just holds on to the Cyrillicness across all the numbers, spaces, and punctuation and when it reaches the "a" it says, "well, that's different! New token!" [20:46:24] probably not - they probably just choose one language set for all the text [20:47:20] one sec - I'll look at the lucene code for a sec.. just to get deeper into this [20:47:32] see if I can spot the line of code that says "NEW TOKEN!!!" [20:48:17] The Elastic standard tokenizer (also probably Lucene) does not care so much about character sets, so you get effectively unsearchable tokens like "chocоlate".. because the second "o" is actually a Cyrillic character! [20:48:30] But we have a plugin to fix that! [20:48:56] I hate cyrllic chocolate.. i prefer german chocolate [20:49:12] LOL! [20:49:36] Yeah, I'm curious because the tokenizer on lucene 9.3 doesn't seem to be there.. it may have branched into it's own project or it was refactored [20:49:41] and it's in a different location [20:49:53] I saw it in lucene 7.3 from a google search [20:50:12] usually with the lucene project documentation if you change the version on the URL string to the latest, viola it shows up [20:52:53] https://lucene.apache.org/core/9_3_0/analysis/icu/index.html [20:53:16] that was way harder to find than I wanted it to be.  But only took a couple minutes.  Better than flipping through a book ;) [20:55:32] do you see any use cases that look like they're good in this javadoc?  I'm still wrapping my head around the examples but don't see anything that hits what you were saying. [20:59:55] i haven't previously looked into the exact details, but i would suspect its about how the icu CompositeBreakIterator, which decides where to split token (i think) uses ScriptIterator to help it decide and works off script boundaries [20:59:58] OH! Another question - when the dumps happen, is there a way to know what the latest completed date is?  Right now my code first downloads the md5sums of the most current backup, but that list is only what's done so far in the latest dump if it's in-progress [21:00:20] The "Text Segmentation" section says it implements Unicode UAX #29 ( http://unicode.org/reports/tr29/ ), which includes the line "Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)."—which is why I always use "3a" in my example. [21:01:19] how did you work around it [21:03:53] (sidenote - I once tried to get some people to get the prince symbol as a unicode symbol.  it didn't work:( . I think it's a valid character and should be reconsidered. U+1F934 is a literal prince emoji, I was hoping to get the rock musician prince symbol - likely a trademark though) [21:04:37] Kristian: sadly the dumps don't have a marker, but they go one at a time in alphabetical order, the closest guess to if a wiki is done dumping is if the next wiki has started [21:05:13] I saw that one of the pages has text that says "in progress" [21:05:22] I guess I can scrape for that? [21:05:34] Kristian: additionally the script looks like it doesn't change the `current` symlink until it's done for the weeks dump (if you mean https://dumps.wikimedia.org/other/cirrussearch/current/) [21:05:55] The implication from the Lucene guy I talked to is that it splits on scripts first, *then* splits on spaces or whatever, which causes the problem. I could probably figure it out in a day or three, but after our discussion in Jira I don't think they'd accept a patch from me. It was a bit tense and did not end well. [21:05:55] yeah.. I noticed that [21:06:29] do you have the jira ticket URL? [21:08:12] I won't comment - I'm interested it to see the lucene guy's reasoning in more detail [21:08:20] We didn't work around the 3a problem. We just accept that some tokens are borked and unfindable. if we use the ICU tokenizer That happens to other tokens for other reasons, too. We fix what we can and (try to) focus on the bigger bang for the buck. (Chinese Wikipedia freely mixed Traditional and Simplified characters, for example—that was a **huge** problem for search until we addressed it.) [21:08:25] URL: https://issues.apache.org/jira/browse/LUCENE-9754 [21:09:22] Robert Muir, I've seen him speak a few times at NY Solr meetups [21:10:06] He wouldn't remember me, but def saw him at least a few times.  One of time times from a meetup the company I worked for hosted [21:10:30] there aren't as many solr meetups anymore though, those were good places to make those cases face-to-face [21:11:27] https://www.meetup.com/nyc-apache-lucene-solr-meetup/ it has 1400 members, lucidworks sorta hijacked it and covid sorta killed it [21:13:11] Okay, gang.. it is after 5pm on a Friday and my daughter has just arrived home from school for the weekend, so I'm outta here! Kristian, it was fun talking to you. Hope to see you in our office hours on the 5th! [21:13:58] I'll be there.  If I come here before that it might be to help with something in the metadata.  But I'll try to work with what I got so far [21:14:47] thanks a lot! this was a ton of help and I'll point you to the repo closer to the date.  it'll be duct taped together, but it'll run in all java so it should work on most platforms without major issues [21:16:34] have fun! [21:19:32] This has been plenty enough for me to look into my next steps.  Especially sharing the parser / indexing code and the training data.  That's so appreciated.  I'm going to first test location via LtR.  I sorta have no idea if it's going to help anything, but it's going to be a fun ride.  Thanks!  Enjoy your weekends! [21:29:54] out for school run