[09:56:52] distillroberta has a limit at 128 word pieces, the 100 lexical tokens window is I think too big [11:07:10] dcausse: https://phabricator.wikimedia.org/P86397 [11:11:46] That gives us paragraph level and info boxes, all nicely labeled [11:15:19] pfischer: thanks! very nice indeed [11:37:15] dcausse: Would you join the ML<>Research<>DPE meeting this afternoon? I have a conflict and can’t make it, but I think someone from Search should be around, there might be questions from Sucheta regarding the Semantic Search prototype. [11:37:56] pfischer: sure, I planned to attend anyways [11:38:09] dcausse: Great, thank you! [14:51:51] Staff meeting or retro in about an hour? [14:52:25] no strong preference either ways [14:56:32] \o [14:56:42] dcausse: ouch, didn't realize it was word pieces, was thinking whole words :( [14:56:53] o/ [14:57:19] i had heard modern tokenization for ml does pieces, but hadn't thought about it [14:57:50] other models I saw claim a limit at 512 but often say "mainly trained with 256 word pieces" [14:58:21] and opensearch won't try any magic to aggregate and average multiple vector if you're past the model limit [15:01:31] :( yea kinda expected though [15:02:07] looks like ozge is suggesting multilingual-e5-large-instruct, not 100% sure but the default max length is 512 tokens [15:02:17] probably same idea [15:03:27] yes, this one is quite large compared to others and will have to packaged to be opensearch compatible, but perhaps it's not too hard and just a matter of writing the proper model config [15:04:24] should be interesting, sadly i don't expect i'll find the time to look into that [15:04:35] he also suggested to do the intersection between simple & enwiki to speed things [15:04:48] intersection, like both in same index? [15:04:48] quickly filtered cirrus enwiki dumps and it's 10times smaller [15:05:16] no take the content from enwiki but only for articles present in simplewiki [15:05:22] ahh, ok [15:06:23] Trey314159: I asked Andreas to facilitate our retro to today and he accepted, so I would be in favour of our retro :-) [15:06:25] also very tempted to use enterprise dumps as quick way to have section/paragraph boundaries [15:06:41] pfischer: cool [15:06:46] if we have access to the data files, seems fine for moving forward with testing [15:07:05] i guess i'm not clear on if that would work in later prod deployments or not [15:07:13] (suspect not, but who nkows) [15:08:09] using enterprise dumps is just an quick way to experiment with what could possibly be obtained if we had proper section boundaries in cirrus dumps [15:12:19] i also wonder what they do with edge cases...found silly things like how you can put section markers in the middle of a table [15:15:53] yes... it's still marked as beta and only works for content namespace as far as I can see, so most probably highly tuned towards the wikipedia way [15:16:28] should be opensource tho so perhaps we can steal some ideas but suspecting that it might not generalize well [15:17:57] i'm not doing anything fancy now, i guess i'm just wondering how specific i have to be. I mean in theory i can just walk siblings and parents and serialize everything between two elements tree-wise, but i was hoping to just find a common ancestor [15:18:21] i suppose i'm also assuming (without being sure) that calling getOuterHTML on dozens/hundreds of elements will be much slower [15:18:41] :/ [15:19:16] (curiously it has to get getOuterHTML, or we would have to check tags, because things like