[09:48:17] re intitle regex a very naive way would be to get rid of the ^ $ on the ngram approximation but re-apply them on the second pass, not super fun but could work [09:50:29] might be more optimal (e.g. for the title "abc") to index boundary ngrams with a / markup char: ab and bc [09:51:43] for the lucene engine I wonder if you can simply revisit the generated automata and rewrite it to transform '^' into that markup [09:52:07] there's perhaps no need to interpred it directly [09:59:57] hm.. perhaps not that simple since you might need to distinguish ^ used in /^foo/ vs /f[o^]/ or even /f\^/ ... [10:00:51] lunch [13:15:04] o/ [14:00:02] \o [14:00:07] o/ [14:00:12] dcausse: indeed i was thinking the same, first pass implement it only in the re-checker [14:00:24] then consider how necessary it would be to add to indexing as well [14:00:48] yes makes sense [14:01:22] i hadn't considered walking the automata...i suppose it might be possible to simply create a modified automata from it [14:03:01] yes but I suspect the hard part is distinguishing the two cases where ^ might appear, the lucene parser might push them the same way in the output automata [14:04:57] I suspect that the simple case where ^ and $ appears at the boundaries is more than enough anyways [14:06:22] playing with unified highlighter it's happily accepting any strings as a locale for the boundary scanner [14:07:08] makes me wonder how to verify that it's actually using the right locale based on the wiki language code... [14:07:37] yea, i was thinking about it overnight...and while it would be nice to properly support generic regex syntax, the effort involved vs first/last char is significant and adds both maintenance and risk [14:07:58] yes definitely [14:08:03] first/last char is some simple string manipulations, and adjusting our .*(regexp).* [14:08:27] the other involves real parsing, either a pre-parse and stringify, or forking the RegExp.java from lucene, neither seemed great [14:08:40] yes... [14:09:42] for the locale, thats curious [14:19:24] it's using Locale.forLanguageTag() and that one will always create a locale object [14:21:31] hmm, yea looking at the doc that's not much of a guarantee. We might have to build up a proper mapping [14:25:27] indeed... [14:54:06] sigh... playing with the underlying BreakIterator I can't make any sense of how the language affects the split, only found thai to have some impact... [15:00:11] looking at the experimental to see if we do any special for find boundaries [15:01:33] we use the "scan" fragmenter which is char based, list of chars is hardcoded in the java code [15:02:20] huh, unexpected [15:04:21] there's a sentence fragmenter but we never enabled it somehow... that one uses a Locale too and is based on similar java components BreakIterator [15:05:07] that would be new territory for us I guess [15:06:46] wondering how cautious we should be here... add config options to vary the boundary scanner so that we can possibly fallback to simpler char based technique [15:14:56] i'm really not sure, i feel like highlighting has always been under-analyzed so not sure whats appropriate [15:15:09] analyzed like, by us understanding whats good/bad, not fragmenting/etc [15:19:17] yes... not sure how much effort we'll put into this and whether I should opt for a highly configurable system or stick to something close to what we have... [15:23:21] hmm, i guess it depends how much we want to work with highlighting. If we want to revisit a few times and run some tests, it would be nice to be configurable. But i suppose we don't yet know how much of an effect highlighting might have on metrics [15:24:30] lol...just got a notification from my internet company. Service went out last night and they were supposed to send me a txt when it was back up. Service came back about 6:30pm, notification came through 9am the next day :P [15:24:53] 8:30 i guess [15:25:26] LOL, someone should tell them about not using their own infra for outage notifications ;P [15:57:28] workout, back in ~40 [16:46:42] back [16:57:39] cirrus config options like CirrusSearchUseThisNewFeature: bool are annoying... hard to compose anything with those... [16:58:53] here I want to still be able to use the cirrus highlighter for some specific cases but not everywhere... [17:02:49] does sound annoying :S [17:07:50] I have a plausible solution to the rechecker put together, but now pondering the accelerated part. Getting the start/end markers into the automaton is easy...but it's not clear how to ensure those are also part of the indexed trigrams. The ngram tokenizer is of course completely independant of the regexp query, i suspect we simply have to have an option that turns on start/end anchors, [17:07:52] and document that for it to work you need a specific pattern_replace char filter that runs prior to the ngram tokenization [17:09:14] (also i need tests, this only works in my head so far :P) [17:10:16] migrating the masters in CODFW now, should be done today or tomorrow barring any setbacks [17:13:06] * inflatador needs to start preparing patches for eqiad [17:13:15] nice! [17:15:50] wondering how the pattern_replace char filter will handle array strings like redirects [17:16:05] hmm, that does seem worth testing [17:19:57] quasi-related, can't open kibana on relforge (was going to use the dev tools console). Suspect without verifying yet it doesn't like opensearch and wants us to switch to opensearch-dashboards [17:20:30] haven't tried it since we migrated [17:22:12] yea, it's being a pain: This version of Kibana (v7.10.2) is incompatible with the following Elasticsearch nodes in your cluster: v1.3.20 @... [17:22:43] i'm sure there is nothing about the dev tools console that requires compatability, it's literally a ui for writing manual REST requets :P [17:23:47] yeah, that sounds about right ;( [17:24:27] i probably don't need it, i can probably stand up the opensearch-dashboards docker container and point it at our dev container [17:24:43] or i can just do vim and curl, but sometimes the UI is nice :P [17:31:53] LMK if you'd like us to install it on relforge or a VM or something...we have beaucoup RAM so that shouldn't be an issue [17:35:26] inflatador: it would probably be nice to replace kibana with opensearch-dashboards there, it's useful for testing things [17:36:02] i'm setting it up locally, but it's being tedious. Because we disable the security plugin in our dev image i have to make a custom opensearch-dashboards image the removes the security plugin as well [17:36:23] or i guess i could use the upstream opensearch image, but i wanted to later run the custom plugin for regex in it [17:36:47] ebernhardson ACK, feel free to start a task and ping us. It's pretty straightforward, although I have seen some issues with it re-enabling the security stuff after you upgrade the package [17:37:16] all in all a pretty minor annoyance [17:38:24] dinner [17:38:35] sure, i'll make a ticket [18:09:44] after far too long...can verify pattern_replace works "as expected" to add the markers to arrays [18:16:03] but...it plays oddly with the NGramAutomaton which also analyzes the ngram's :S [18:19:16] Going to have to ponder this, I don't really understand why we analyze the individual ngram pieces [18:21:50] oh, silly me..i just have to define separate search and index time analyzers [18:24:21] lunch, back in ~40 [19:05:11] * ebernhardson wonders if it matters that the replaced anchors leak into errors like TooComplexToDeterminize...could pass the un-mangled source but doesn't seem super important [19:18:12] back [19:29:30] * ebernhardson is mildly surprised to see my tests are working, including with the accelerated trigram bits...although i'm mildly suspicious. More specific testing required [19:33:16] but if i put two docs in the index with ["abc", "def"] in one and ["abc", "xyz"] in the other, ^abc profiles with next_doc_count=2, and ^def with next_doc_count=1 [19:33:36] on the accelerated bit of the query [21:02:43] inflatador: few mins late [21:03:28] ryankemper np, just joined myself [21:19:31] * ebernhardson doesn't understand why intellij keeps finding duplicate classes in target/* folders, when target is marked as excluded directory :S [21:46:56] (╯°□°）╯︵ ┻━┻ [22:14:21] ryankemper forgot to add the net-new eqiad hosts in our patch (ref T391680) , if you have time today feel free to make a patch and hit me up for review [22:14:22] T391680: Bring elastic112[345] into production - https://phabricator.wikimedia.org/T391680 [22:23:12] inflatador: I'll take a look if I can figure out the wdqs ui deployment charts stuff that I'm looking at now