[08:08:38] o/ [11:26:22] lunch [13:13:52] o/ [13:31:56] \o [13:40:55] o/ [13:47:52] .o/ [14:07:17] one thing i wonder...did we get something wrong in the prod deployment? The number of responses that are only "Recensements (*) ou estimations de la population" doesn't seem right: https://fr.wikipedia.org/w/api.php?action=query&format=json&list=search&formatversion=2&srsearch=Quelle%20est%20la%20capitale%20de%20la%20France%3F&cirrusSemanticSearch [14:08:30] yes saw this... not quite sure I explain why the model likes so much this string for the question [14:09:18] saw that you added the prompt in the model connector as a param, perhaps this does not work as expected? [14:12:40] dcausse: hmm, maybe. I suppose i could repoint it at relforge and use tcpdump to verify what comes over the wire [14:13:52] it's also much faster now somehow, 50 qps w/ 40G of indexes on 17G of disk cache. potentially related to upgrading 3.3 -> 3.5, but not certianly [14:16:01] oh wow [14:16:23] thats also w/ 64kb readahead instead of the default 8, but my earlier tests all capped at 35 which seemed like a cpu limit [14:16:28] 8mb [14:17:17] the awkward place it gets now instead is that it will cap out the heap and reject requests if locust sends too many requests. Didn't decide what to do with that yet [14:17:55] heap is relatively low at 4g IIRC [14:17:56] probably limit cirrus-side concurrency, although we have the awkwardness that eqiad and codfw have separate poolcounters (iirc) [14:17:57] OpenSearch 3 also supports gRPC https://docs.opensearch.org/latest/api-reference/grpc-apis/index/ , might be something to play with if MW supports that [14:18:24] inflatador: it might be interesting, but i suspect (without looking) that grpc in php is tedious [14:18:48] 50qps is probably more than enough at this point and yes a poolcounter might be enough to avoid entering this failure mode [14:19:50] other things still pending...we should get envoy setup for the cluster, right now codfw<->eqiad has to do the TLS setup round trips every time instead of holding a connection open. But we might not have the query rate to keep the connection open anyways [14:20:12] ACK, if no one else is using gRPC in MW, we probably don't wanna be the first [14:20:40] grpc.io has a php quickstart, so it's at least possible. Something to ponder but probably only saves 10 or 20ms [14:21:30] everytime you ask something related to a city you get a wall of "Recensements (*) ou estimations de la population" :/ [14:23:34] it's the Demographie template [14:25:10] I wonder if this causes hnsw to go wrong if there are many items with very similar vectors [14:32:06] hmm, maybe? [14:32:12] but generaly speaking a passage "Recensements (*) ou estimations de la population" does not bear anything useful outside of the following table... [14:32:40] maybe look for and exclude duplicate paragraphs as template stuff? [14:33:08] i dunno what the limit would be, but if you have 20+ paragraphs with exactly the same content seems a reasonable guess none are good search results [14:33:14] yes possibly, tho sometimes there a slight variations which I'm not sure where they some from [14:33:25] s/some/come [14:34:26] but indeed a separate pass to attempt to remove/detect duplicates could not hurt... if a passage is repeated many it's probably that it's a header/footer of some sort [14:34:56] bm25 would naturally downrank those but here they tend to be noisy [14:35:24] yea makes sense, i hadn't thought of it but bm25 is indeed downranking duplicate content [14:38:15] school run, back in a few [14:59:27] back [15:51:11] I can't make the search meeting this week, sorry [15:53:53] school run, will be late for the meeting [16:03:56] errand, back in ~45 [17:31:20] is there any work in progress in the dse-k8s cluster? we just got paged [17:31:25] https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:53] inflatador: ^ [17:32:03] i'm not aware of anything, but I'm not deeply involved there [17:33:50] volans I am not aware of any issues either, but will ping my team in #data-platform-sre Slack [17:34:18] ack thx, looked like both dse-k8s-ctrl1001 and dse-k8s-ctrl1002 were unreachable, I'm checking thigns [17:34:51] they are reachable again [17:35:58] Thanks for the ping, I notified everyone in Slack but I'll take a look if no one responds within a few min [17:36:56] see -sre for more live context [17:52:01] inflatador: hmm, seems something is up with the semantic-search cluster, fetching the banner just stalls. logs are complaining about discovery failing at 17:36 (~15 minutes ago), seems plausibly related [17:52:18] [2026-03-09T17:38:18,778][WARN ][o.o.d.SeedHostsResolver ] [opensearch-semantic-search-masters-0] failed to resolve host [opensearch-semantic-search-discovery] [17:52:38] suggests in-cluster dns is down? [17:52:47] * ebernhardson doesn't understand how k8s dns works :P [17:54:05] (MediawikiContentHistoryReconcileEnrichJobManagerNotRunning is also alerting in addition to CalicoKubeControllersDown) [18:28:12] dinner [18:31:58] ebernhardson the dse-k8s cluster is back [18:40:31] thanks! [19:21:18] inflatador: could i get you to re-apply the 64kb readahead again? [19:31:42] ebernhardson on it, 1 sec [19:33:51] awesome, thakns [19:34:17] ebernhardson np, it should be active now [20:02:59] meh, indeed the instructions in the connector don't work as expected, it gets a null instead of the actual instructions. Will work up a patch for cirrus to provide from there, seems more robust [20:03:10] (and it avoids manually encoding json with string builder) [20:10:20] * ebernhardson suspects the reduced token count could also explain increased qps [20:20:38] yea the wierd results from capital of france was indeed the instructions, looks much better now [20:48:56] qps still looks good, it actually maxes around 70qps w/ 10 users and then starts coming back down