[07:17:04] errand [09:10:47] pfischer: the regex thing Erik's working on applies to our custom highlighter [09:11:44] yes it's to support a new feature to allow matching anchors and make sure the custom highlighter applies the same strategy [09:14:48] dcausse: thanks! [09:16:05] I doubt opensearch might accept to add regex support in one of the standard highlighters, if we manage to add everything else we need to opensearch we could possibly get rid of our custom highlighter, the regex support could possibly added to a new lean®ex focused highlighter in the extra plugin (where the regex query is written) [09:20:55] pfischer: do you have access to our sonarcloud account? if yes, is the failure on https://gerrit.wikimedia.org/r/c/wmf-jvm-utils/+/1140518 something you would know how to fix? [09:21:13] dcausse: looking [09:21:16] thanks! [09:37:06] dcausse: The project does indeed not exist among the sonarcloud projects. I thought I had sonar cloud access but obviously I do not. So I pinged pwangai. Let’s see who else might be online now from Testing… [09:37:27] thanks! [09:50:12] dcausse: Testing is on it. While putting together the weekly report, I came across a few tickets tagged with WM-1.44 (2025-02-29). One of which is T389053 (rename ORES weighted tags) - What options do I have to see if that change is picked up by the reindex maintenance job? [09:50:13] T389053: Rename weighted_tags referencing ores in their names - https://phabricator.wikimedia.org/T389053 [09:50:50] pfischer: looking [09:52:57] pfischer: I think first you a couple patches to deploy: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1135019 & https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1135010 [09:53:52] once deployed we'll have to wait for the next re-index to happen, we might want to coordinate with Trey for that and try to bundle the reindex with something he has worked on (sudachi for instance) [09:54:20] for context a reindex is not automated but run by us [09:54:21] Okay, so reindex would happen manually anyways. [09:54:26] yes [09:55:42] and we won't run a full reindex during the opensearch migration so I suspect this'll have to wait for eqiad to fully run opensearch [09:57:43] Alright. I’ll merge the helm-CR, but that should be independent of the CirrusSearch code changes. [09:58:42] pfischer: yes I think so, we'll have some mixed names but that's fine, they'll get cleaned after the mw-config patch & the reindex [10:02:59] lunch [10:07:29] dcausse: sonar cloud is running again for the jvm-utils. Apparently it got auto-pruned from sonarcloud due to inactivity, sth. that can’t easily be worked around. [12:21:34] pfischer: np, thanks for raising this! [12:53:25] o/ [13:59:07] \o [14:01:45] o/ [14:12:43] o/ [14:32:43] picking up my son, back in ~30 [14:49:42] back [15:02:02] dcausse: pondering if there should be an "extended syntax" flag for the highlighter? In the query side it seems reasonable to auto-detect the field wrapping, but i wonder if thats a bit more magical at the highlighter side. Also you can't really auto-detect if we expanded char classes like \w [15:02:37] or, lazy way that seems inappropriate but probably works, the highlighter can use java pattern's instead of lucene for highlighting. Since we are aiming for a subset that might "just work" [15:03:15] but it seems better for highlighting to be exact, and not probably the same [15:04:07] ebernhardson: definitely I don't think you can infer that automatically from the highlighter, the "options" array should allow us to pass something without breaking binary transport format [15:04:41] so I'd be for passing a flag there and re-use the lucene engine if possible [15:04:43] oh excellent, i hadn't noticed that [15:20:15] inflatador: hmm, the systemd errors for opensearch-disable-readahead.sh are that the wrong # of arguments are provided, but checking puppet it has all three :S [15:21:06] but it even has a trailing space before the third argument, suggesting base_data_dir was not set? [15:22:47] ebernhardson yeah, I reached the same conclusion and noted in ticket, but haven't had time to follow up [15:22:55] i'll have a patch momentarilly [15:23:02] excellent! Ping me when it's ready [15:24:06] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140741 is probably what it needs [15:26:58] dcausse: for CharSequence, are you thinking to change only the inputs, or also the return values? I'm not too familiar with CharSequence, just see it's an interface String implements [15:27:55] ebernhardson: yes I believe all the logic can with with CharSequence and it could be up to the caller to decide to materialize the string or not [15:28:21] not sure if that's going to be helpful but I know some lucene apis do return CharSequence to avoid meterializing the string [15:29:01] ok that seems reasonable [15:30:34] also minor change, but i think i'm going to change the anchors to use utf8 non-characters, \uFDD0 and \uFDD1. Probably doesn't matter, but the ones i was using are "ok for interchange", while these are supposed to stay strictly internal to an application [15:31:43] docs also say if we are going to use non-characters we should replace them coming in with \uFFFD, but not sure it's necessary [15:32:48] makes sense, seems like a good fit for this range [15:34:21] unsure about replacing input occurrences with uFFFD, perhaps we should do it? but what happens if they're part of the title? hopefully that's not possible? [15:34:59] just merged puppet patch above and run puppet on `cirrussearch2100`. That indeed fixed the problem, thanks ebernhardson ! [15:36:08] I can check the dumps for those chars, but indeed unlike the PUA \uE000 which has use cases for some thing (which trey found in the wikis), the non-characters should have no use [15:48:51] hmm, except changing the anchors makes the regexp not match :P Will have to poke and see if lucene is doing something fancy there [15:49:26] :) [15:52:40] i guess i never looked...but apparently determinizing and minimizing the automaton is a pretty fancy (=complicated) algo [15:54:08] small CR to add the new relforge host if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140749 [15:54:20] yes... there's some utf16 conversion IIRC [15:55:25] maybe though i just mucked up changing the CharSequence code, i should have just done one at a time... checkout and repeat :) [15:59:48] :/ [16:02:21] workout, back in ~40 [16:06:33] yea it was just me doing something wonky, writing and running the tests every few changes and it all works fine [16:41:39] heading out, have a nice week-end [16:43:06] Typographic Nerd Fun Fact: homoglyphs show up on-wiki in multiplication/dimensions, such as 2Χ2 (Greek, elwiki), 2Х2 / 2х2 (Cyrillic), 2X2 / 2x2 (Latin), and—depending on your fonts—2×2 (multiplication sign). [17:02:27] just restarted blazegraph on wdqs1013. I depooled it, lag alerts should clear shortly [17:03:28] Trey314159 nice, I've been meaning to do some homoglyph mischief ;P [17:52:01] ryankemper if you're around and you have the cycles, may want to work on a patch to remove the old codfw masters. Then we can decom some of the older CODFW hosts T390901 [17:52:02] T390901: Decom elastic2055-2060 - https://phabricator.wikimedia.org/T390901 [18:25:21] lunch, back in ~40 [18:41:13] inflatador: Нοмοglурh Μιѕϲніeϝ might have to be the name of my new band! [19:13:22] meh...spotbugs wants me to use Function.identity() instead of s -> s, for a UnaryOperator no-op, but i can't find a way where that actually works :S [19:15:00] sorry, been back awhile [19:23:44] * ebernhardson was not paying attention...there is a UnaryOperator.identity() [19:35:45] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140778 here's the patch to remove the old elastic hosts [20:51:00] ryankemper thanks...good catch on that missing chi master BTW, I noticed it but forgot to do anything about it [20:53:24] this highlighter is just escaping me...pretty sure what it's doing is returning positions to highlight, but it's not clear what the appropriate way to know when to adjust those positions is. Especially because it has modes to highlight forwards or backwards (reversed string) [20:54:37] i was hoping for a much simpler answer of stripping the anchors post-transform [20:57:05] just merged ^^, running puppet on cirrussearch2100 now [22:21:20] well..finally have something that appears to be working. But now to figure out how to write the test cases for all the extra stuff this does, like verifying position gaps with multi-valued fields, multi-fragment, case-insensitive, etc. :S