[09:25:46] re truncate_norm, my bad sorry... I backported it upstream but obviously it's not available in 1.3 :/ [14:44:31] \o [14:45:04] o/ [14:45:14] re:truncate, i'm testing in 3.x so switched it to the truncate token filter, but no clue if that actually works [14:45:26] was pondering though that we will probaly have to do something before the next MW release [14:55:23] iirc truncate is mainly used on titles (leaving aside wikibase), titles should have a max length imposed from the db, so perhaps fine to simply drop truncate_norm if not available? [14:58:17] hmm, probably. I would have to review to be sure but yea it's mostly on titles which are already limited [15:07:47] actually realized i didn't swap it to the truncate char_filter, i saw that in there...but just removed all the truncates from the mapping since it was easier [15:44:40] dcausse: i'm still not quite sure...what should i be doing with the highlighting stage? [15:44:52] do we just go with the OS semantic highlighter? [15:46:12] ebernhardson: I suppose there will be no highlight but only extracting the matching passage from the inner hit match and populate the SearchResult with this blob of text and the section? [15:47:15] hmm, yea i suppose can just return the whole passage text, we have that [15:47:31] i guess i was thinking we would want some bolding in there [15:47:50] but with it coming from a totally different model, it kinda breaks the idea that "these bolds are why we chose this excerpt" [15:47:56] the OS semantic highlighter is missing too much imo, I suspect it's because the model it's using is too far away from the embedding model used [15:47:58] yes [15:48:59] all my testing is with a terrible model anyways, using minilm with 368dims, since it doesn't really matter with the 3 pages in my dev instance [15:49:22] sure [15:50:37] I think all this falls into everything we don't do post knn query but where we'll probably have to invest into something... (reranking and possibly term matching) [15:51:59] if you find something reasonable we should definitely try but yes everything easily available in OS did not seem very appealing [15:53:17] i wonder a bit what it takes to go from a more generic model to a highlighter model, suspect it's some form of fine-tuning on the existing models, but that sounds like a far too large project [15:53:54] what I'd be curious to test is https://huggingface.co/Qwen/Qwen3-Reranker-0.6B and see how hard that is to plug into https://docs.opensearch.org/latest/search-plugins/search-pipelines/rerank-processor/ [15:54:44] highlight could be seen as findind the best matching sentence of the best chunk of text [15:57:34] llama.cpp-server has a rerank mode wondering if it emits something easily consumable by OS rerank processor (someone made a gguf version available at https://huggingface.co/Mungert/Qwen3-Reranker-0.6B-GGUF) [15:58:50] hmm, yea that could be interesting to get into things as well, the idea would be to make it a network call away? Or are there bits already written to use FFM or JNI to access it? [16:01:29] no OS does not support such models with last token pooling so what worked for me is spinup a local llama.cpp-server and do remote calls [16:01:40] ultimately that would have to live in liftwing [16:02:24] I found llama.cpp quite easy to use with their docker image [16:02:47] what's harder is craft the opensearch connector to talk to it... [16:02:58] ok, i'll take a look at that. I mostly have it "running" locally, although not entirely happy with everything. Various bits of the code are just awkward...but fine [16:03:09] sure [16:24:09] heading out, have a nice week end and safe travels! [16:24:21] \o [17:40:52] :S https://github.com/opensearch-project/OpenSearch/issues/17742 - [Feature Request] Add configurability to run ingest pipelines during document update operations [17:41:58] it looks like it might work with doc_as_upsert (which we usually use) but still...that seems like a silly limitation [17:42:15] (also we wont use the pipelines in prod, but i'm testing local dev here :P) [19:28:55] hmm, this also needs some understanding of how to link to the snippet, or at least the appropriate section :S [19:36:37] a curious paper released by anthropic: "How AI impacts Skill Formation" -- "Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.": https://arxiv.org/abs/2601.20245