[06:59:48] o/ [07:49:17] o/ - I am not doing so well today, I’ll take a day off, maybe I’ll try something later. dcausse: so now the flink operator is up to date (T398162), it should be safe to continue with the flink app update, right? [07:49:18] T398162: Flink: Update k8s operator to 1.12.0 - https://phabricator.wikimedia.org/T398162 [07:50:28] pfischer: hey, sad to hear this, take care! re flink-op yes we should be good to go I think I 1ed you patch IIRC [10:47:09] lunch [13:01:04] o/ [13:29:53] \o [13:34:05] o/ [13:42:45] "srwiki New index seems too small compared to the previous index 1220386/0 > 0.1 (old/new > threshold). Aborting. (Use --force to bypass)" :/ [13:44:03] zero docs ? [13:44:37] yes... this is strange... [13:45:26] yea i have no ideas :S [13:46:38] separately cindy keeps failing the test to 'go to random page, type `main p` into search bar, find `Main Page`. But if i ssh/port forward into cindys instance and do the same...its fine [13:47:03] :/ [13:47:28] could be ui change that broke the browser test? [13:50:19] yea possibly, i'm currently looking at that and realizing i have no clue where "cdx-search-result-title" comes from [13:50:45] but a `git log -Ssearch-result-title` isn't finding it...but that does seem like a plausible rason [13:50:48] *reason [13:51:34] yes I found those hard to correlate when inspecting the webpage and looking through mw code :/ [13:52:03] sigh getCount() in Elastica is rather suspicious...: return (int) ($data['hits']['total']['value'] ?? 0); [13:52:11] heh [13:52:41] I guess it might silently ignore failures :/ [13:59:07] indeed [14:03:19] separately, what do you think of escaping in regexes? I suppose the two critical questions: 1) Do we need to deal with unknown escapes, or just pass them through literally? like insource:/\y/. 2) Do we care that escapes can turn into regex syntax? For example insource:/\u002e/ (2e=.) could match anything, or it could match a literal period. But to do that we need to turn \u002e into [14:03:22] "." to create a literal, but not if inside a character class [14:04:01] and then i guess...i was tempted to put this in cirrus because easy, but it's probably more proper in wmf-jvm-utils [14:07:21] so now we pass the escape directly to the query? \y is treated by the lucene parser which will interpret that as y right? [14:08:13] i think lucene treats the \ as literals [14:08:26] it doesn't do escape sequences at all, so \n searches for the literal sequence \ and n [14:08:41] but we can pre-expand them and then it searches just run [14:08:44] s/just run/just fine/ [14:09:28] weird... how do you escape regex syntax with lucene then? [14:09:30] but i guess the question is, can we be lazy? We can do a very simple string replace \n => expanded \n, \t => expanded \t, \uNNNN -> codepoint NNNN [14:09:39] dcausse: lucene does literals wrapped in quotes [14:09:45] so "." searches for a literal dot [14:09:57] but + for instance? [14:09:59] \+ [14:10:04] hmm, not sure [14:11:01] hmm, it does something, because `insource:/\+/` does find a page containing only + [14:11:08] ok, so more investigation needed :P [14:11:12] :) [14:11:40] \y finds y so I it treats escape [14:12:08] yea sounds like it's treating them as literals [14:12:26] which I guess is good, that'd be less of a breaking change [14:12:44] but that would imply if we only replace \r \n \t and \uNNNN it might not need a full parse [14:13:03] although i probably have to check if they are inside quoted literals or themselves escaped [15:00:18] dcausse Trey314159 ebernhardson are we good with cancelling retro? [17:21:56] huh, apparently in java \uNNNN escapes are replaced pre-compilation, while \n is replaced later. This means you can't use \u000A (== \n) in a string [17:24:28] "\u000A" == "\n" or even "\u000A".equals("\n") == false? [17:26:44] dcausse: yup, since it happens pre-compilation the compiler sees a string on two lines [17:27:19] i was just using it in a test case, easy enough to use \n instead, but is an interesting edge case of java [17:27:58] https://stackoverflow.com/questions/3866187/why-cant-i-use-u000d-and-u000a-as-cr-and-lf-in-java [17:32:37] oh interesting [17:35:19] works with java15 multiline strings :) [17:35:27] lol, i suppose yea [17:35:35] intellij complains that you could use the more traditionnal \n :) [18:01:25] dinner [18:04:32] If anyone is interested in learning more related to the discussion about how everything turns into a Gaussian if you add enough of them together, there's a 3B1B video on the Central Limit Theorem here: https://www.youtube.com/watch?v=zeJD6dqJ5lo — though that exact result does only apply to repeating the same distribution. [18:04:49] For the super nerdy you can generalize the CLT to adding different distributions using the Lindeberg CLT ( https://en.wikipedia.org/wiki/Lindeberg%27s_condition )—though I have not yet gotten an intuitive grasp of the extra condition. And, of course, just because something is not *proven* to converge to a Gaussian doesn't mean it won't. [18:06:29] Just checking, does the name of the weighted_tags stream need updated in docs here? https://wikitech.wikimedia.org/wiki/Search/WeightedTags#CirrusSearch_Update_Pipeline [18:08:03] ottomata: hmm, yea the producer points at mediawiki.cirrussearch.page_weighted_tags_change.v1 [18:11:22] updated the links, not 100% sure if the example needs an update [18:11:27] i don't think so, but not 100% [18:16:05] thank you! [18:17:32] separately, looking at the retries from the current reindex run, they all timeout waiting for green on a new index [18:17:58] not sure if we need to increase the timeout, or wonder why they aren't reaching green fast enough (how fast?) [18:19:40] heh, curiously the reindexer waits for green forever (after it applies the settings change to the new index that should already be green) [18:20:26] apparently we wait 2 minutes, i guess that can increase. Although 2 minutes feels like it should be enough [18:20:49] lunch, back in ~40 [18:28:28] hmm, silly question...what do we do with half a surrogate pair? Tempted to naively expand the half pair and get what it gets [18:37:29] * ebernhardson mutters at java which wont allow me to type a literal string with invalid surrogate pairs. [18:47:19] Ugh.. I don't remember all the details, but I had some extra complexity in the most recent plugin to deal with surrogate pairs. IIRC, If I got a lower surrogate I'd look for an upper surrogate following and treat them as a unit (for checking case, counting characters, etc.) and if I hit an unmatched upper or lower surrogate I would just let it ride—not my circus, not my monkeys. [18:48:57] i suppose part of the question is, could there be a valid use case to search for broken surrogate pairs? I suspect they don't actually get into the search index (we had problems with that in the past) [18:51:58] yea, thats ParserOutputPageProperties::fixAndFlagInvalidUTF8InSource [19:05:41] back [19:06:50] I cannot trick Firefox and/or Mediawiki to allow me to search for unmatched low surrogates. I can enter high surrogates, but they get converted to �. If fixAndFlagInvalidUTF8InSource prevents them getting into the index, then I guess it's okay... [19:06:55] ...but I can imagine weird stuff showing up in PDFs or other files uploaded to Commons, for example.. so I could imagine wanting to search for broken pairs... but it seems very difficult to even make the request, much less worry about it being handled correctly. [19:09:31] well, with \u encoding they get to skip having to make it work all the way through, but yea i think i'm going to ignore bad surrogate pairs. They wont match anything since we can't index invalid utf8 anyways [19:10:49] thankfully the regex engine doesn't care, because i'm injecting a `\` before each \u expansion to get the regex engine to treat it as a literal (fixes expansions treated as regex syntax) [20:40:30] * ebernhardson notes that opensearch does accept `"auto_expand_replicas": "2-2"`, but maybe not the best idea :P [20:41:40] or maybe cirrus should just construct that from the constant...having to choose between the two values in different places (reindexer, index creator, replica range validator, etc) [21:06:11] ryankemper do you have anything for pairing? I'm trying to make https://grafana.wikimedia.org/goto/muqNR9uNg?orgId=1 look slightly less terrible ;) [21:08:05] inflatador: brt [21:11:30] https://phabricator.wikimedia.org/T392222 [21:11:53] https://grafana.wikimedia.org/goto/rSPaRruNg?orgId=1 ryankemper does this link work? [21:12:03] yes!