[12:49:09] \o [12:50:07] pfischer: hmm, kerberos seems odd. I think with the airflow-devenv it's supposed to pick up your ticket, so something else is happening but i'm not sure what :S [13:00:38] o/ ebernhardson: hm, airflow-devenv is setting up a kerberos instance during setup and asks me for my password, so it should have everything it needs. The spark-submit definitely works, so I wonder why it doesn’t when run from AirFlow. What would have been your deployment routine before airflow-devenv? [13:44:59] pfischer: i would run the spark-submit command from fixtures, but manually adjusted to work in context [13:49:16] probably worth asking, maybe in #talk-to-data-engineering, about the kerberos thing. I'm sure someone knows [13:53:54] (which was #data-engineering-colab, it was renamed like 30 minutes ago) [14:18:58] ebernhardson: I already ran spark-submit with the latest hive source-table snapshot and kafka-test as sink. That works. - let’s see if anyone has an idea [14:19:34] pfischer: i suppose in general thats as far as i take testing, if the fixtures generate and seem reasonable, and the script runs, thats about all the testing i do before it goes to prod. And yes some thing get to prod and don't work :P [15:32:41] cormacparle: I was out yesterday. Do you want me to reply with more detail on the community wishlist question about stemming and completion? [15:48:58] Trey314159: ebernhardson: Is there anything besides the done/reported tickets you would like me to mention in our weekly report? The etherpad is rather empty. ;-) [15:55:15] pfischer: nothing too exciting going on. We found and ironed out a few bugs in the Japanes/Sudachi analysis chain and Erik's fixing some regex corner cases, but those are small followups to mostly done projects. Are we reporting on big/interesting stuff or trying to justify our continued existence? If the former, nothing worth making other people read about from me—though I made you read this long reply... [15:56:38] pfischer: hmm, mostly just moving forward those small bits, trying to run some AB tests but it hasn't quite worked yet [15:57:00] well, it's turned on but the test treatment isn't applied until we roll forward the javascript fixes [16:06:24] Trey314159: 😃 Well, the report says ‘Highlights’ but usually it’s everything that was done. [17:03:51] Trey314159: random thought...I'm looking at the wikidata highlighting problem, and the general problem is that we highlight across \n boundaries. [17:03:58] see https://www.wikidata.org/w/index.php?search=pageid%3A17802558+sweet+potato&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&cirrusDumpResult [17:04:34] what if we always split the text content on \n before indexing? I was pondering doing that for only wikidata...but maybe it's useful everywhere [17:05:43] it would mean instead of `"text": "감자\nPotatoes\nTerpomoj"` we would index `"text": [""감자", "Potatoes", "Terpomoj"]` [17:07:04] iiuc...the two changes would be phrase matching would no longer match across boundaries, and that highlighting would also not cross boundaries [17:07:14] i guess insource too [17:07:38] * ebernhardson actually wonders what insource does when it's an array [17:08:01] ebernhardson: in the example you shared I don't see the highlights across \n.. but in general I *think* it sounds like a good idea. I sometimes see matches across sentence or paragraph boundaries that don't mean what I searched for. [17:08:53] Trey314159: the wikidata one should have a single long highlighted string that highlights `Potatoes` `The Potato` and `Sweet Potato`, but notably that is a single highlighted string so there is no order preference presented [17:09:11] There may be someone somewhere who likes the searching for "end of paragraph. Heading Start of new paragraph"... but I can't imagine why... [17:09:30] PotatoSweet Potato [17:09:43] right, but there is a newline between Potato and Sweet Potato [17:09:46] Ugh.. formatting is hard in IRC [17:10:01] so those are actually two separate things that should be chosen between to highlight [17:10:16] i suppose the other part of this is php has to post-process that, it splits on \n and then chooses the first line that has highlight's [17:10:40] (because it would be really wierd to display the rest of that wikibase source text in the search results) [17:10:43] Can you see the open and close formatting characters? It's `{Potato}\n{Sweet} {Potato}`... so nothing crosses \n [17:11:06] Trey314159: right but thats the problem, `Sweet Potato` should be the preferred highlight, over `The Potato` or `Potatoes` [17:11:22] Trey314159: but because it's a single highlighted string there is no preference, it's just one string with three bits highlighted [17:13:57] Okay, so it's not that a single highlight itself crosses \n, but how snippets are chosen from a long string with multiple highlights and \n in it. So the theory is that "{Sweet} {Potato}" would be chosen as the snippet over "The {Potato}" if they are individual strings. Sound reasonable.... [17:14:26] yea, thats the hope at least. I suppose i haven't actually indexed data, but the stats hopefully line up :) [17:14:54] The opposite problem, though, would be if you searched for `short story potato` and couldn't get one snippet with both `short story` and `potato` in it. [17:15:02] but yea, the goal is to get the highlighter which has all the stats available to choose between the strings, instead of pushing that to php where we simply take the first line that has the markers [17:15:26] hmm, indeed. [17:18:06] Pie in the sky idea: if the highlighter can rank strings as snippet candidates, and you have a lot of short ones, you concatenate them in order until you hit your snippet length quota.. then maybe you get "{Sweet} {Potato} ... Korean-language {short} {story} by Kim Dong-in ... The {Potato} ...". Not sure that's better.. you could end up with multiple copies of almost the same string. Hmmm [17:22:41] breaking on \n is a super easy normalization to add (except i have to double check it doesn't break sup schemas :P), tempted to simply add it and see how it goes. If it's enabled everywhere we can also apply that at reindex time so it happens now instead of after saneitizer has a few months [17:23:32] otherwise could have wikibase content handler provide the text as an array of strings, but i suspect that will be annoying to thread through [17:29:16] lunch, back in ~30 [17:46:46] * ebernhardson sighs..indeed changing it requires a schema dance in sup [19:19:35] been back, but taking a quick walk. Back in ~20 [19:51:47] back [20:31:16] I'm having "fun" trying to get java installed in a docker container with blubber. "21.85 Errors were encountered while processing: 21.85 openjdk-17-jre-headless:amd64 [20:31:47] Lemme see if we have any other images trying to install java on bookworm [20:40:13] there is a thing with man pages or something, iirc [20:40:54] yep, looks like it `update-alternatives: error: error creating symbolic link '/usr/share/man/man1/java.1.gz.dpkg-tmp': No such file or directory` [20:41:27] I wonder if there's a flag you can pass to apt to make it not try to install man pages? Or maybe I have to install the man page package? [20:41:36] you just have to mkdir before installing [20:42:02] we have a container somewhere with example dockerfile...but i'm not finding it yet :P [20:42:18] I **have** been looking at https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/tree/main?ref_type=heads [20:42:52] inflatador: that wont help, that is a redhat image [20:43:10] yeah, I realized that ;( [20:43:25] It's OK though, if all I need to do is `mkdir -p`, that's easy enough [20:51:01] best i can find is this, but yea just mkdir basically: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/blob/main/docker/Dockerfile?ref_type=heads#L187 [20:54:58] ah OK, let's see if that does the trick [21:02:19] blubber seems to be ignoring my `command` stanza. I'll try getting rid of the apt part and seeing if that does anything? Open to suggestions. Here's my blubberfile: https://paste.opendev.org/show/828450/ [21:03:31] ah, I guess the container doesn't run as root, it can't write to /usr/share [21:08:18] hmm, it's still unhappy even when I install the manpages pkg. Hmmph [21:10:32] maybe I need a separate build target just to mkdir? Guess I'll give that a whirl [21:19:47] ryankemper you have anything for pairing? I've just been working on ^^ , not having much luck ;( [21:37:12] I'm giving up for now. Have a great weekend, all [21:49:35] inflatador: sorry was out with dog. I’m just looking over graph split documentation