[08:45:21] <dcausse>	 re cirrussearch highload, started to dig into queries and saw weird patterns again of queries targetting the commons index from non-commons wiki (was from wikidata this time), wondering if we should flag such queries in the metrics to better correlate them with latency issues
[08:47:09] <dcausse>	 geodata searches are quite popular as well, with pretty bad p50 at >500ms :(
[08:50:48] <pfischer>	 dcausse: thank you for looking into this. When you say “targeting from non-commons” is that based on a referrer header or how do we know where they come from?
[08:51:37] <pfischer>	 Did the latency increase only with the traffic or was it slow before?
[08:52:24] <dcausse>	 pfischer: when your search includes the file namespace (NS_FILE==6) the query will blend the current wiki index + the commons index 
[08:53:55] <dcausse>	 it's not hitting the commons.wikimedia.org directly but using the using the underlying commons index, e.g. (see the path) https://www.wikidata.org/w/index.php?search=file%3Atest&language=en&title=Special%3ASearch&ns0=1&cirrusDumpQuery
[08:57:07] <dcausse>	 the latency increase is more spikes here and there but seems to have started around nov 25 (https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=now-30d&to=now&timezone=utc&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-98) 
[09:00:04] <dcausse>	 searching ns_file from wikidata seems dubious because I don't think you can upload directly to wikidata
[11:00:15] <dcausse>	 lunch
[14:02:29] <inflatador>	 <o/
[14:04:24] <Krinkle>	 FYI: wikimedia/discovery/* repos on Gerrit are now included in Codesearch.
[14:04:26] <Krinkle>	 https://codesearch.wmcloud.org/deployed/
[14:04:36] <Krinkle>	 https://gerrit.wikimedia.org/r/c/labs/codesearch/+/828055
[14:04:55] <Krinkle>	 They're included in the default "Everywhere" profile, and in the "MediaWiki & services at WMF" profile
[14:05:56] <Krinkle>	 List at the bottom of https://codesearch.wmcloud.org/deployed/?action=repos
[14:07:27] <Krinkle>	 example: https://codesearch.wmcloud.org/deployed/?q=org.wikimedia.discovery
[14:23:59] <inflatador>	 ebernhardson are you OK with me closing out T410681 or are there other things needed?
[14:24:00] <stashbot>	 T410681: Setup opensearch 3 on relforge servers - https://phabricator.wikimedia.org/T410681
[14:56:46] <ebernhardson>	 \i
[14:57:36] <ebernhardson>	 Krinkle: thanks! I do miss some of those at times
[15:05:58] <dcausse>	 oh nice, thanks!
[15:06:00] <dcausse>	 o/
[15:06:15] <ebernhardson>	 inflatador: i zuppose we can close it out
[15:06:17] <Krinkle>	 if you want to add anything else, let me know of file a phab task
[15:06:24] <Krinkle>	 know or file*
[16:04:16] <inflatador>	 ryankemper we're in pairing if you wanna join
[16:04:24] <inflatador>	 err...stand-up that is
[16:23:00] <ebernhardson>	 looking at the enwiki w/vectors index, not terrible.  At least it works, the vectors i uploaded from spark don't query properly (yet, not sure why)
[16:24:05] <ebernhardson>	 it's pretty huge though, half a TB with no deleted docs..i guess it's perhaps 2x the size of the prod index
[16:33:54] <dcausse>	 ebernhardson: last time I checked a simplewiki index there I saw that the vectors were not normalized
[16:34:15] <dcausse>	 and using l2 might not be great in that case
[16:34:52] <ebernhardson>	 dcausse: yea that might be the case, i suspect i have to sit down and play with settings until the same numbers come out of both sides
[16:35:34] <dcausse>	 imported the full enwiki from cirrus dumps using an ingest pipeline and very naive chunking and results are sometimes meh... (ssh -L12222:localhost:12222 stat1009.eqiad.wmnet)
[16:39:43] <dcausse>	 index size is 481.1gb (15 shards, no replication) all fields should be indexed not only the vectors and it's already +200Gb than the enwiki index with deleted docs
[16:40:05] <ebernhardson>	 nifty demo, i didn't go further than buttons to press in opensearch dashboards :)
[16:40:16] <dcausse>	 :)
[16:45:02] <ebernhardson>	 one thing i wonder about is model versions....versioning in the ml world is very unspecific. There are so many variants of models, shifted between formats, it's not clean if the model i have in sparknlp is even the same (bit-for-bit) model
[16:45:06] <ebernhardson>	 s/clean/clear/
[16:46:04] <dcausse>	 yes... not sure how to check without sending a text and comparing actual vectors
[16:46:15] <dcausse>	 saw some opensearch models in the sparknlp db
[16:47:16] <dcausse>	 or perhaps we need some tool to package the same model to sparknlp and opensearch 
[16:47:54] <dcausse>	 for opensearch I read the the model config, it says at least if it does pooling, vector normalizaion and what vector space to use
[16:50:02] <ebernhardson>	 what i can confidently say, is i have something claiming to be distillroberta on both sides, but the numbers coming out don't match :( i doubt it's just normalization...
[16:50:28] <ebernhardson>	 pondering if it's worthwhile to learn about shifting the model formats, to load the data files from the opensearch model into sparknlp 
[16:50:42] <ebernhardson>	 but we already have a working index, maybe thats enough to experiment with (also, not sure what experiments to run :P)
[16:50:52] <dcausse>	 yes saw that you imported huggingface/sentence-transformers/all-distilroberta-v1 to opensearch
[16:52:15] <dcausse>	 perhaps learning about the technical bits, e.g. semantic highlighting, cross encoder reranking, fusion (it has all the fields so could possibly upload the enwiki ltr models and fusion both?)
[16:53:44] <ebernhardson>	 Hmm, yea highlighting is a big open question worth looking into. Reranking and fusion also worth considering
[16:56:17] <dcausse>	 but I think that proper chunking (and proper selection of the content) is going to be key, here I see a lot of pollutions from content we don't seem to properly move to the extra_content field
[16:56:54] <dcausse>	 who wrote The Hitchhiker's Guide to the Galaxy? -> "Guide to the Galaxy. p. 31. Adams, Douglas (1994). The Illustrated Hitchhiker's Guide to the Galaxy. New York, NY: Harmony/Crown. p. 4. ISBN 978-0-517-59924-2." from the Trillian (character) page :/
[16:58:01] <dcausse>	 which is I think a citation that leaked into the text field
[16:58:53] <dcausse>	 dinner
[16:59:15] <inflatador>	 workout, back in ~40
[17:59:08] <ryankemper>	 Walking dog
[18:04:55] <ebernhardson>	 hmm, attempting to use semantic highlighter regularly hits the memory circuit breakers :S Limited it to only opening_text, but still regularly hits the breakers.  Will have to look into what memory usage to expect
[18:28:34] <inflatador>	 back
[19:08:51] <ebernhardson>	 hmm, semantic highlighter is being difficult...it seems like it always returns the entire opening text, but with one bit highlighted (number_of_fragments seems ignored)
[19:09:41] <ebernhardson>	 poking at the code is does support multiple highlights in a text field, might depend on other things
[19:12:14] <dcausse>	 could it work on the nested field chunks (passage_chunk_embedding.chunk field)? 
[19:13:05] <ebernhardson>	 yea..there are no options. only bits passed down from the query to highlighter are field (resolved to text), model_id, query text, and pre/post tags. And indeed, it always returns the full source string.
[19:13:28] <ebernhardson>	 dcausse: hmm, maybe. will test
[19:19:28] <ebernhardson>	 i think we would have to reindex with `store: true` on the passage_chunk_embedding.text field mapping to highlight like that (maybe, at least as is it complains about missing field)
[19:21:35] <ebernhardson>	 curiously it's not requiring that of the non-nested query, mapping from the sub-documents into the source doc should be plausible, but maybe it just doesn't do that 
[20:38:21] <ebernhardson>	 took a look over what we have available to get sections from the wikitext, it looks like parsoid gives each heading a unique id tag, and gives us the list, so we can find the heading in the source document. sadly there isn't a convenient tag wrapping the section, so some heuristics still required
[21:09:24] <ebernhardson>	 oh...curious.  cirrus is maybe using a version of the output that hasn't been through final processing?   public content has `<h2 id="History_and_etymology">History and etymology</h2>` but cirrus sees `<h2 data-mw-anchor="History_and_etymology">...`
[21:10:23] <ebernhardson>	 (the functions we call literally say that, too)
[21:38:23] <ebernhardson>	 hmm, i have a plausible method that works in small testing. Will have to scale up.  But basically we have the heading ids, find common ancestor between pairs, and serialize each sibling between the two
[21:38:38] <ebernhardson>	 (probably needs lots more error handling :P)
[21:45:07] <inflatador>	 ryankemper I have to pick up my son from school, might be 5-10m late for pairing
[22:15:15] <ryankemper>	 inflatador: I might have to skip pairing, still working on getting this roof leak patched up
[22:15:28] <ryankemper>	 inflatador: kicking off reboots when im back
[22:16:16] <inflatador>	 ryankemper ACK, I just got back myself but np if you can't pair. I'm deploying https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1213586
[22:40:59] <ryankemper>	 inflatador: back now
[22:50:29] <inflatador>	 ryankemper ACK, I think I'm gonna tap out for now. The operator needs to be deployed in CODFW , pinged in https://wikimedia.slack.com/archives/C055QGPTC69/p1763651458113119 for Balthazar to take a look at it tomorrow