[10:21:27] my mapping was wrong indeed, chunk vs text, hopefully I could change the mapping online and get highlighting to work but it's extremely slow :(, ~2s for 10 results vs ~0.3 without highlights [11:09:17] lunch + errand [15:02:32] \o [15:02:49] yea i was seeing similar trying to highlight opening_text [15:03:41] o/ [15:22:53] i'm not super clear how much faster it can be...the main thing i found about faster semantic highlighting was the opensearch can shift some of that work to index time (and size), but opensearch does not [15:23:13] err, elasticsearch can shift [15:24:41] oh haven't seen that about elastic, for opensearch they suggest using a remote highlighter and batching (https://opensearch.org/blog/batch-processing-semantic-highlighting-in-opensearch-3-3/) [15:25:54] yea but that's kinda cheating, instead of making it faster they make a remote api call :P [15:25:55] but the gain is not that huge tbh, 1,937s vs 1,685 apparently [15:26:00] yes :) [15:26:46] hard to say how much better it is in elastic, they have some blog post that is basically advertising about why their vector search is better [15:26:54] which is where they mention the optimization [15:27:26] it would be a whole nother set of vectors to keep in the index as well tohugh [15:27:54] yes... [15:29:30] it certainly suggests to me that if we are going to start integrating these expensive things, we need to think about query routing some more. query heuristics that turn on the expensive stuff and can be tuned for how much compute we can spend [15:29:39] was about to try cross encoders but given the highlighting speed I'm not totally sure that's worth the effort [15:29:42] yes totally [15:30:54] and probably limit to very low top_k like 3 and feed the rest with lexical results [15:31:13] yea indeed, this semantic highlighter can't apply to long result lists [15:51:44] I have an appointment in ~40, so won't make the weds mtg [16:24:32] ^^ heading out for said appointment now, back in ~90 or so [17:39:27] dinner [18:10:27] back [18:17:23] huh..turns out that yea, officewiki is different :P ParserOutput::getTOCData returns null :S [18:17:37] (which has the section metadata) [18:20:52] i suppose we don't really need that, can select mw section headers by css, but was convenient to have a more concrete source of truth [18:30:00] oh, no...i chose a page with fake headers made with bolded text :P [19:03:52] * ebernhardson ponders the right way to signal WikitextContentHandler to generate the sections...seems like at least initally the code should be optionally run [19:45:11] ebernhardson: bolded fake headers… reminds me of explaining the benefits of MS Word header styles to my class mates 20 years ago - lets hope that does not happen too often [20:48:33] lol, of course...was running my section splitting routine on random pages in officewiki. It failed on an SRE meeting notes with 194 sections :P [21:10:00] oh, of course. it's not just that they have a bunch of sections. It's that there are sections inside a table, and we removed the table to another field [21:10:14] * ebernhardson has no clue what to do with that yet :P [21:15:33] ebernhardson: huh? which page is that? [21:17:09] pfischer: lots of examples, but this is one random one: https://office.wikimedia.org/wiki/SRE/Meeting_Notes/SRE-2019-08-19 [21:17:54] for the moment i guess i will throw out missing sections from the TOC, i was kinda hoping to keep the requirement they exist as a sanity check [21:31:32] ebernhardson: ouch, that is indeed gruesome. If we log those mismatches, Growth has a new source of clean up tasks… but yeah, tossing the sections seems reasonable, after all its just a prototype.