[09:25:46] <dcausse>	 re truncate_norm, my bad sorry... I backported it upstream but obviously it's not available in 1.3 :/
[14:44:31] <ebernhardson>	 \o
[14:45:04] <dcausse>	 o/
[14:45:14] <ebernhardson>	 re:truncate, i'm testing in 3.x so switched it to the truncate token filter, but no clue if that actually works
[14:45:26] <ebernhardson>	 was pondering though that we will probaly have to do something before the next MW release
[14:55:23] <dcausse>	 iirc truncate is mainly used on titles (leaving aside wikibase), titles should have a max length imposed from the db, so perhaps fine to simply drop truncate_norm if not available?
[14:58:17] <ebernhardson>	 hmm, probably.  I would have to review to be sure but yea it's mostly on titles which are already limited
[15:07:47] <ebernhardson>	 actually realized i didn't swap it to the truncate char_filter, i saw that in there...but just removed all the truncates from the mapping since it was easier
[15:44:40] <ebernhardson>	 dcausse: i'm still not quite sure...what should i be doing with the highlighting stage?
[15:44:52] <ebernhardson>	 do we just go with the OS semantic highlighter?
[15:46:12] <dcausse>	 ebernhardson: I suppose there will be no highlight but only extracting the matching passage from the inner hit match and populate the SearchResult with this blob of text and the section?
[15:47:15] <ebernhardson>	 hmm, yea i suppose can just return the whole passage text, we have that
[15:47:31] <ebernhardson>	 i guess i was thinking we would want some bolding in there
[15:47:50] <ebernhardson>	 but with it coming from a totally different model, it kinda breaks the idea that "these bolds are why we chose this excerpt"
[15:47:56] <dcausse>	 the OS semantic highlighter is missing too much imo, I suspect it's because the model it's using is too far away from the embedding model used
[15:47:58] <dcausse>	 yes
[15:48:59] <ebernhardson>	 all my testing is with a terrible model anyways, using minilm with 368dims, since it doesn't really matter with the 3 pages in my dev instance
[15:49:22] <dcausse>	 sure
[15:50:37] <dcausse>	 I think all this falls into everything we don't do post knn query but where we'll probably have to invest into something... (reranking and possibly term matching)
[15:51:59] <dcausse>	 if you find something reasonable we should definitely try but yes everything easily available in OS did not seem very appealing
[15:53:17] <ebernhardson>	 i wonder a bit what it takes to go from a more generic model to a highlighter model, suspect it's some form of fine-tuning on the existing models, but that sounds like a far too large project
[15:53:54] <dcausse>	 what I'd be curious to test is https://huggingface.co/Qwen/Qwen3-Reranker-0.6B and see how hard that is to plug into https://docs.opensearch.org/latest/search-plugins/search-pipelines/rerank-processor/
[15:54:44] <dcausse>	 highlight could be seen as findind the best matching sentence of the best chunk of text
[15:57:34] <dcausse>	 llama.cpp-server has a rerank mode wondering if it emits something easily consumable by OS rerank processor (someone made a gguf version available at https://huggingface.co/Mungert/Qwen3-Reranker-0.6B-GGUF)
[15:58:50] <ebernhardson>	 hmm, yea that could be interesting to get into things as well, the idea would be to make it a network call away? Or are there bits already written to use FFM or JNI to access it?
[16:01:29] <dcausse>	 no OS does not support such models with last token pooling so what worked for me is spinup a local llama.cpp-server and do remote calls
[16:01:40] <dcausse>	 ultimately that would have to live in liftwing
[16:02:24] <dcausse>	 I found llama.cpp quite easy to use with their docker image
[16:02:47] <dcausse>	 what's harder is craft the opensearch connector to talk to it...
[16:02:58] <ebernhardson>	 ok, i'll take a look at that. I mostly have it "running" locally, although not entirely happy with everything. Various bits of the code are just awkward...but fine
[16:03:09] <dcausse>	 sure
[16:24:09] <dcausse>	 heading out, have a nice week end and safe travels!
[16:24:21] <ebernhardson>	 \o
[17:40:52] <ebernhardson>	 :S https://github.com/opensearch-project/OpenSearch/issues/17742 - [Feature Request] Add configurability to run ingest pipelines during document update operations
[17:41:58] <ebernhardson>	 it looks like it might work with doc_as_upsert (which we usually use) but still...that seems like a silly limitation
[17:42:15] <ebernhardson>	 (also we wont use the pipelines in prod, but i'm testing local dev here :P)
[19:28:55] <ebernhardson>	 hmm, this also needs some understanding of how to link to the snippet, or at least the appropriate section :S
[19:36:37] <ebernhardson>	 a curious paper released by anthropic: "How AI impacts Skill Formation" -- "Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.": https://arxiv.org/abs/2601.20245