[15:43:08] final bits from evaluating capacity in semsearch cluster: https://phabricator.wikimedia.org/T414623#11690020 [15:43:46] the summary is i think we could get by with 16 pods / 32g disk cache + 6g heap per pod. Gives ~610g total [15:44:10] that would also need ~107gb of disk including the 70% fill limit [15:49:25] 107Gb? [15:50:05] I guess you mean 1073g? (from the task comment) [15:50:23] 107g per node, 107*16, 1712g total [15:50:26] s/node/pod/ [15:50:38] oh ok [15:50:50] does this include room for a reindex? [15:52:17] hmm, actually no that's not quite enough :( i guess we need another 700g, which works out to 170g per pod [15:52:37] sec updating [15:53:21] can we have 38G pods, I thought we were limited to max 32? [15:53:55] hmm, i guess i don't know, 32g sounds very small but it depends on whats in the cluster. looking [15:54:09] i also put options for 24 or 32 nodes, but i suppose i have a preference for fewer larger nodes [15:54:38] +1 to fewer/larger nodes [15:54:50] 24 nodes at 27G seems viable too if 38G is an issue [15:54:53] inflatador: thanks! [15:57:24] there are indeed a number of dse-k8s nodes with very small (64-128g) of memory showing, not sure what the limits are [16:01:28] I can talk to the rest of the team, but 38GB pods should be OK. The eqiad workers have between between 64GB-1TB [16:01:32] RAM [16:03:21] one thing i'm not sure what to do with...the cluster kinda needs warmup after a rolling restart, otherwise filling the disk-cache takes a minute or two. I've been using locust for that, but we might want something that varies it by wiki or some such [16:03:47] actually filling the disk cache takes a minute or two with the warmup, not sure how long it would stall if it was just from natural queries [16:06:16] there's a /_plugins/_knn/warmup/index1 but no clue how efficient that is [16:08:03] i'm wondering what that does, if it's about loading the faiss index into memory, or if it also accesses the raw vectors. I suppose i don't know for sure, but i've been assuming the raw vectors are where it's stalling [16:10:22] i think it's just the faiss bits, but probably worth trying. From docs: The warmup API operation loads all native library indexes for all shards (primaries and replicas) for the specified indexes into the cache, so there’s no penalty for loading native library indexes during initial searches. [16:12:15] sigh.. does not seem super useful indeed... [16:14:33] one other randomly awkward bit, but you can't change a model/connector that's deployed. So in opensearch_config.py we will probably need two variants of the same model/connector so it can be changed, and then the default search pipeline updated. Otherwise we would have to temporarily undeploy the model [16:15:41] yes... some orchestration is required on this front too... [16:16:20] i had pondered making it natively handle that...but it seemed like it would get awkward and we are better off with a/b variants [16:16:30] but maybe not call them a/b, has other implications :P [16:20:43] :) [16:21:59] did we fix opensearch perms so that we're not forced to use the operator user? [16:22:09] not yet ;( [16:22:19] ack, no worries [16:39:21] dcausse: opensearch_config.py can now apply roles/role groups. I've used that to give the anon user access to load models / msearch / etc. [16:45:46] that's how cirrus queries without auth [16:46:01] oh ok [16:47:20] nice! [16:49:01] T416714 is the task for creating a separate admin user, but it's blocked on T417328 [16:49:01] T416714: OpenSearch on K8s: Create separate admin user for cluster operations - https://phabricator.wikimedia.org/T416714 [16:49:02] T417328: Explore K8s-native OpenSearch user management - https://phabricator.wikimedia.org/T417328 [18:14:51] somehow I can't trigger the semantic builder from Special:Search it profile is somewhat hidden when running on web? [18:15:07] dcausse: oh, i should have filed a ticket. Thats the interwiki search [18:15:14] ah ok [18:18:11] ebernhardson: did you somehow fix the connector to ship the instruction, I seem to see better results but I don't think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1249389 got backported? [18:18:27] dcausse: yes, i moved the text into the function itself rather than trying to use a parmaeter [18:18:35] oh ok [18:18:51] wish there was more visibility, had to setup llama on relforge and tcpdump the incoming connections to see what it actually sends :P [18:19:10] sigh... yes, this connector thing is giant mess tbh [18:21:13] a bit worried that the index does not get updates... I might index a new snapshot before I get this pipeline running [18:21:27] esp. given the current events [18:21:59] yea seems sensible [18:22:17] i glanced over the patch for building dumps, will get a review in [18:30:18] thanks! [18:30:55] but no rush tho, I'm still on opensearch bulk import atm... [19:17:10] dinner [20:08:38] meh, it looks like the mjolnir problem is that discovery.query_clicks_daily has null's for session_id [20:13:23] They started going missing at discovery.query_clicks_daily/year=2025/month=8/day=28 but not seeing anything special about that date [21:14:58] * inflatador is wondering why I get `java.io.FileNotFoundException: /usr/share/opensearch/config/opensearch-performance-analyzer/plugin-stats-metadata` on my local minikube, but the image works fine in prod