[08:31:15] Trey314159: discussed with Effie regarding sigma and apparently it's quite annoying to complete words if a user types ς at the end, so perhaps we should finetune the icu_normalizer (only for completion) for greek and exlude ς & σ, possibly letting them being treated by icu_folding [09:48:52] dcausse: regarding the spark-kafka-sink: We still have your initial idea of a separate page_weighted_tags_change topic for sudden large volumes. As a slight variation of this we could use explicit partition assignment, too: all revision-based changes go into partition 0, the rest is distributed among partitions 1-4. This way we could set up a two consumers in flink one that feeds deduplication (partition 0) and one that [09:48:52] consumes partitions 1-4 and bypasses deduplication. [09:54:50] pfischer: yes this idea needs to be discussed a bit if we want to pursue it, haven't thought to much about leveraging partition for this tho [09:56:01] but it might be orthogonal to rate-limiting, we could ask Andrew if that's necessary or just for us a way to span the updates on a longer period [09:57:14] a separate partition/topic for bulk updates would also require a separate topic/partition in the stream that's between the cirrus-streaming-producer and consumer [10:02:56] lunch [11:25:34] dcausse: aaah you're right - I misread what the user was asking for. Thank you! [13:15:20] o/ [14:20:31] dcausse: That's fair for sigma, but I'm a little worried about mobile keyboards. Looks like Android has a separate ς key but iPhone dropped it and just automatically converts it from σ. So in the middle of a word as you type, σ comes out as ς until you type another letter. [14:20:33] It'll still match σ, but only as a typo, so any other letter will match, too. That'll give some unexpected suggestions as you type. OTOH, it looks like Greece is not a big iPhone market. I'm not sure if any of this will solve Spiros's problem, as LSJ seems to be using the English analysis chain. [14:20:59] I'll open a phab ticket [14:21:41] Trey314159: sure, makes sense to followup with a phab ticket, thanks [14:24:19] one thing I noticed is that google autocompletes "ανθρωπο" with "ἄνθρωποσ"... seems weird [14:26:01] seems like google is normalizing all ς to σ in their query completion (even on www.google.gr) [14:30:41] ah not if I set my browser language to greek [14:53:03] hm... there's a stalled backfill job in codfw [14:53:28] last log msg says {"@timestamp":"2024-10-15T10:34:56.354Z","log.level": "INFO","message":"Application completed SUCCESSFULLY" [14:54:00] the flinkdeployment status is: flink-app-consumer-search-backfill RECONCILING STABLE [14:54:26] seems like the operator is struggling to understand the job status? [14:56:33] "not if I set my browser language to greek" ... that's interesting! [14:59:31] was for wikiids: mnwiki [14:59:53] Trey314159: did you notice something weird when you re-indexed? something never ending? [15:14:15] dcausse: no, I didn't notice anything like that. All my command line commands finished okay. I didn't look at all of the underlying jobs, though. [15:23:56] I see `CirrusSearchSaneitizerFixRateTooHigh` alert, is this related? [15:24:44] inflatador: no... I don't think so but the sanitizer is not looking good on cloudelastic :/ [15:24:45] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic [15:24:53] hmm, that is more like a thing to create tickets rather than alerts to "fix now" [15:25:30] true [15:28:26] perhaps it's worth spending some time investigating the logs we have for failed updates to understand whats going on. With the big spikes i wonder if thats a problem with the checker api and not the actual docs [15:28:30] seems like it mainly oldVersionInIndex [15:29:41] maybe need to log the exact page identifier with the errors, to give a way to cross-check the fetch-failed logs [15:29:44] we're talking about alerts in retro now...should I make a change the saneitizer alerts to make a phab task? It might already be that way, I just saw it in our alerts channel [15:30:14] yea that alert is much more on the phab task side of things, it's not something you should or could fix in an hour [15:30:27] it's more like a systemic issue [15:33:06] sigh.. 20min and counting, the wdqs updater build is still downloading the flink image from archive.apache.org... [15:34:41] Should we pitch in and get them a CDN? ;P [15:35:50] hmm, for cirrus-config-dump, should we go in and prune connection username/password, or simply exclude CirrusSearchClusters and CirrusSearchServers from the output? [15:35:55] they have but only for recent releases, I guess a very effective way to incite people to upgrade their stack :) [15:37:15] unsure if CirrusSearchClusters and CirrusSearchServers are useful to be viewable from the config so perhaps we can simply drop it? [15:37:46] dcausse: i was thinking same, i've had some use cases in the past for python scripts to connect to the cirrus elasticsearch instances, but i think i primarily did that through ExpectedIndices.php and never the config dump directly [15:39:05] yes... if we need to do that I guess we'll have to source the connection info from a secure place if user/pass are required anyways [15:39:48] i'm already thinking how annoying this will be, my curl test line is up to `curl -k -u admin:admin -H'Content-Type: application/json' ....` [15:40:16] maybe need a little wrapper script for making manual elasticsearch calls [15:40:36] or leave the password in bash history :P [15:41:23] :) [15:42:47] that sanitizer graph is certainly odd... all the lines go above 100000 but only oldVersionInIndex is actually above that value... [15:43:50] dcausse: the lines are stacked, so the top line is the sum [15:44:04] i thought it was easier to see if they weren't all crossing each other [15:44:05] ah [15:44:17] * dcausse feels stupid [15:44:46] there is indeed no hint that it's stacked, other than the numbers and the lines not adding up [15:44:56] might be nice if they had a slightly different rendering or something [15:45:34] yes other annoying part is that some lines get overriden by others depending on the ordering [15:46:57] i can't decide what to do with these gradle plugins...spent like 3 hours yesterday and it's going nowhere. Next is either to try and get a mostly empty "example" plugin compiling with gradle and then migrate that gradle config over, or simply do it in maven like our plugins [15:47:14] as much as i don't know maven, i know less gradle :P [15:47:58] dcausse: also curious what you think we should do about hebmorph, it looks like they stop updating in general. We need to release hebmorph for lucene 8.10.1, but it being AGPL makes me uncertain if we can just do that? [15:48:52] ebernhardson: re gradle I've also spent quite some time to debug them, sometimes it was just bugs in the elastic build plugins... [15:49:41] yea i've been suspicious of the elastic build plugin, most of my errors are around pulling in some project from netflix called nebula [15:49:46] which come from there [15:49:49] for hebmorph I'm not sure... I did some small version bumps in the past and did not bother too much, I'm quite unfamiliar with agpl, does it prevent you from doing changes? [15:50:48] dcausse: it doesn't prevent, but if we modify the code we have to provide the code and there are some requirements around how much we have to communicate that to the users [15:51:17] i guess i didn't look close enough, but my impression was the 7.10.2 plugin was only updating the plugin, but not the hebmorph library it pulls in [15:51:31] I did change hebmorph already [15:51:35] ahh, ok [15:51:48] https://gitlab.wikimedia.org/repos/search-platform/HebMorph [15:52:00] i guess we just continue that then, should be fine [15:53:14] well.. what I did was very minimal: https://gitlab.wikimedia.org/repos/search-platform/HebMorph/-/commit/e2bf372f8dd20709026b44f6cbe0c160a26fef77 [15:54:34] re gradle, IIRC there are ways to disable some of the build steps it does, (license checks), I remember Doug or myself disabling some on ltr [15:54:56] hmm, i suppose i haven't tried turning various things off yet. Will poke the build plugin and see what options we have [16:13:06] so close: 42 passed, 1 failed, 1 skipped, 44 total (100% completed) in 00:12:40 [16:23:57] hmm, very suspicious: "Cannot search on field [labels.en] since it is not indexed." [16:24:17] we don't disable indexing there, and it should default to true... [16:24:30] * ebernhardson hopes this isn't some "too many fields" protection [17:16:04] dinner [17:16:33] hmm, maybe the better question is how does that work in the normal cindy :P It turns of wikibase mediainfo doesn't pass stemming info along to LabelsProviderFieldDefinitions, so it creates all languages as un-indexed [19:00:04] * ebernhardson wonders what we did in the past for keeping branch up to date with master, i guess merge commits? [19:49:19] sus: 43 passed, 1 skipped, 44 total (100% completed) in 00:12:28 [19:52:47] not actually sure what the 1 skipped is about :S but the regular voting runs have the same output. [19:55:47] q