[14:16:02] \o [14:27:08] .o/ [14:28:27] Almost done with the OS 3.5 image (at least I hope so). Running into a weird issue where it can't find the ICU plugin, I can't seem to reproduce it outside the container [14:28:38] huh, weird [14:29:09] It's probably something silly like I set the wrong URL [15:05:49] OK, got it. ebernhardson should I use this version of innerhits or do you want to publish a new one? https://gitlab.wikimedia.org/repos/search-platform/opensearch-innerhits/-/pipelines/166021 [15:06:57] inflatador: hmm, looks like it would miss the getType() fix, i'll get a new release going for 3.5.0-wmf7 [15:10:23] in theory it's only reverting the 3.3.2 patches, probably :) [15:24:51] inflatador: https://gitlab.wikimedia.org/repos/search-platform/opensearch-innerhits/-/packages/1979 [15:30:10] ACK, will take a look [16:25:39] ebernhardson: did you rebase your CRs? [16:26:10] pfischer: i squashed one of the patches into another, getting three patches through the patch deploy window might be a bit tedious so brought it down to 2 [16:26:24] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1229644/7 is abandoned and two others are *not current* [16:26:27] basically merged the smaller patches, so query routing and query building are now in the same patch [16:27:06] pfischer: "not current" means the latest patches aren't the same git hash, if you go to the latest patch it should be correct: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1229645/11 [16:27:35] no code changes, just a squash of the two smaller patches [16:28:49] i suppose i should have asked, wasn't sure if you had started looking at them yet and was pondering how it can take 30+ min to deploy an extension patch during the deploy windows (although sometimes it's better, i know they've done work to improve that but rarely use it) [16:29:21] No worries. I just got an error when I wanted to submit my comments. [16:30:22] hmm, i thought we could still submit comments to older and/or abandoned patch sets. interesting [16:31:28] Also forgot to mention at meeting: I have a one-liner to set ureadahead that I'll share, if things are on fire we can always set the values ad-hoc [16:31:55] inflatador: thanks, that’s good to know [16:33:32] it's also not the end of the world, latency is just bad and throughput is ~half. But my initial tests on frwiki came up with ~14qps and 700ms latency with the full readahead, and 32qps and 280ms latency with the reduced readahead. And really the 280 is still high, i'm testing with 10 concurrents but we only have enough cpu for 3-4 concurrents, the extra users queue. Latency in more [16:33:34] like 100-150ms but still 32-35qps if we bring down the concurrent users [16:33:46] 14qps is probably still plenty for the test, but that high latency bugs me [16:35:10] I'll work on the Daemonset approach, maybe it will be easier than I thought [16:35:22] in the meantime I can always run that one-liner after y'all deploy [16:37:02] inflatador: hmm, could you run it now? The other thing i wanted to test was how things go when we also have indexing running, but it didn't seem as meaningful to test that while readahead's were high [16:37:52] ebernhardson will run as soon as I'm out of mtg [16:39:18] currently we have 6 instances running with 21Gi each, =126Gi total. Gives ~100G for disk cache against 95G of indexes, so i suppose i will turn on a replica to get it to 50% index:memory ratio. [16:39:42] inflatador: thanks! Lets go with the 64kb readahead. I suspect we could potentially go lower, but it seems a bit awkward to keep setting different values to test like this [16:39:58] * ebernhardson should probably ask for root access in dse-k8s or some such [16:55:03] separately wondering what the appropriate maxK is for knn search. I noticed yesterday while checking into how the thing santosh linked in haute-cosine worked that in opensearch with the on_disk configuration we use, it always fetches at least 100 results or 3x (configurable) the requested results from the 1-bit knn to rescore with the full 32bit embeddings [16:55:21] i randomly put 21 as the maxK, initially (matching default page size), but maybe we can increase later [16:56:29] I was wondering where that 21 was came from… [16:57:40] pfischer: 20 is the page size, and we handle pagination by always requesting +1, if the extra result is there than pagination is possible (even though in this case we don't allow pagination) [16:57:41] Hm, would increasing maxK simply increase the retrieved result-set we may later re-rank? [17:00:08] pfischer: basically yes, although there are levels to the re-ranking. maxK controls the first stage of re-ranking which is the 1-bit quantized KNN search followed by reranking with the 32-bit embeddings. Re-ranking with the model would be another layer on top that should be separately configurable (although not in the current configuration) [17:00:25] * ebernhardson didn't realize that first stage of reranking existed until yesterday :P [17:01:46] but apparently the "big idea" of on_disk knn in opensearch is that they bring down the in-memory data structure to 1-bit quantized, and then pull the 32-bit vectors from disk only for somewhere between 100 and 10000 docs depending on configuration [17:03:02] it also naively suggests that our reads are going to be roughly 1024 dim * 4 bytes, or ~4kb. 100 doc rescore means pulling 100 4kb blobs from different places in the index, which kinda explains why 8mb readahead is soo terible [17:04:30] it also suggests we might want to continue trying lower readaheads, but it's currently a bit tedious to bounce around between people to get the readaheads set for different testing. I'm sure the network overhead at some point will dominate over the sizing [17:21:04] ebernhardson you should be all set. I'll create a repo for the playbook soon [17:22:14] inflatador: thanks! [17:29:32] playbook is up at https://gitlab.wikimedia.org/repos/search-platform/sre/ansible-playbooks/dse-k8s/-/blob/main/opensearch-ureadahead.yml?ref_type=heads [19:00:17] inflatador: can we try 16kb read_ahead? [19:02:13] w/ full cluster and 64kb seeing ~2.9GB/s of ceph traffic which might be what we get, but feels high [19:16:06] or maybe go all the way to 4, since we think that's the minimum viable readahead [19:16:18] (which at that point wouldn't be readahead, just fetch the thing that was requested) [19:28:04] ebernhardson OK, it's at 16 now [19:28:30] thanks! now to remember to close/reopen [19:34:11] seeing < 500MB/s now at 35qps (seems to be ~max cpu usage). latency went down a bit too, it suggests this system hates readahead :P [19:36:01] Nice [19:36:40] it also maybe really needs some sort of warmup, because the first few requests after close/reopen take ~6s [19:39:35] Just published the 3.5 image if you want to redeploy [19:40:01] ok, i'll give it another ~5 min to finish the current test and then redeploy [19:41:05] i also wonder what limits we should be setting in cirrus, this sees max throughput at ~5 concurrent users, but not sure how that applies once we add more nodes and enwiki. Guessing that enwiki with lots-o-shards will have lower, but non-enwiki will continue scaling...maybe just leave it at 10 [19:41:44] after 5 concurrent users we see the same throughput but more latency (as the requests wait in a queue) [19:42:15] but i suppose on the upside, it implies we can scale throughput mostly through cpu (maybe) [19:49:26] We do have more headroom with CPU [19:54:17] * inflatador just reminded himself about https://en.wikipedia.org/wiki/Max_Headroom_signal_hijacking [19:56:14] i think at the moment we don't have need for higher concurrency, what we promissed (iirc) is 5qps, so even the current setup is over-provisioned there. But i feel more comfortable knowing it can scale up as they ask for more in that aspect. [19:56:31] (well, i guess that also depends on query embedding scaling) [19:56:57] inflatador: patch to update docker image: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1247669 [20:01:22] heh, claude suggested another hack. LD_PRELOAD a .so that intercepts mmap requests and sets the madvise() [20:04:42] ebernhardson just responded on the CR, I was seeing a different image hash [20:04:58] hmm, thats what i got after pulling the image and running docker inspect on it. looking [20:05:59] oh, maybe they are two different things? I grabbed the value from `Id`, but your value matches `RepoDigests` [20:06:13] * inflatador just realized that too [20:06:31] let me try my docker pull again, you could be right [20:06:31] i was totally guessing, i guess i should figure out how to resolve that [20:07:26] no, I think it has to be my way? `docker pull docker-registry.wikimedia.org/repos/data-engineering/opensearch:3@sha256:a40b9331a3229496c7788920d6874add1d15800df86228668c3dc13f55f274fe [20:07:26] Error response from daemon: received unexpected HTTP status: 500 Internal Server Error` [20:07:55] apparently `Id` only references your local docker daemon, and RepoDigests is the correct value [20:08:36] FWiW, I actually could see the same id with `docker image inspect` but I couldn't pull with it [20:12:20] updated [20:23:12] +2/merged [20:24:18] thanks! [20:24:57] also going to try a very dumb idea...drop a custom librandom_madvise.so into the data directories (so it survives reboots) and the LD_PRELOAD that. If it actually works to disable readahead could get it into the image, but for now just trying to see what works [20:25:30] librandom_madvise.so redefines mmap and injects the necessary madvise code while wrapping the underlying mmap [20:25:31] Nice, keep me posted [21:02:27] sigh..i broke something :P [2026-03-03T21:02:23,643][INFO ][o.o.s.c.ConfigurationRepository] [opensearch-semantic-search-masters-2] Wait for cluster to be available ... [21:02:50] claims not a quorum, but it has 5 nodes running [21:03:11] oh, maybe this from another node: Caused by: java.io.IOException: No space left on device [21:03:13] sigh [21:05:35] * ebernhardson should have probably paid more attention than just restarting when cluster was green [21:55:55] ebernhardson FYI You will have to completely wipe out the cluster per https://w.wiki/H$ru if you want to be sure readahead to goes back to the default value. As long as the pod stays on the same worker it will keep the readahead value [22:01:14] yea i might have to delete it anyways, i can't convince -2 to come back up :S [22:01:51] one curiousity, i would have expected this to work but it has a dns failure: kubectl exec opensearch-semantic-search-masters-1 -- curl -sk https://opensearch-semantic-search-masters-0:9200/ [22:03:21] but this works, points at -0: kubectl exec opensearch-semantic-search-masters-1 -- curl -sk https://10.67.28.157:9200 [22:03:33] but running the same thing on masters-2, which refuses to come up, just stalls [22:03:42] inflatador: any idea how i might have borked the network to make that happen? [22:05:07] ebernhardson no, I didn't test the new image much [22:05:43] let me try deploying on the opensearch-test NS to see if I can reproduce [22:06:04] inflatador: shouldn't be the image, this should be underlying k8s config. maybe calico or some such, basically -2 is firewalled off from -0 [22:06:31] i don't have netcat, so have been using curl to force a network connection [22:07:01] i guess i could just delete the cluster, i asked david where the notebook was for that since i expected to break things :P [22:07:30] but i'd like to build enough knowledge to fix the clusters with a delete/reindex, that seems painful when we are in a more production-ish stance [22:07:36] without a delete/reindex [22:10:27] yeah, there are some sharp edges we need to work on for sure [22:12:10] the other odd thing is multiple pods did come up. 0, 1 and 3 all came up. 2 refused to come up, and 4 hasn't tried (i'm guessing operator is waiting for -2 to work) [22:14:56] maybe try deleting -2? [22:15:22] inflatador: indeed, tried a couple times. Can do it again though, sec [22:16:59] inflatador: you mentioned it always loads on the same node, maybe something awkward about the host? Can you cordon the node so it doesn't get assigned new pods while i delete it? [22:17:17] ebernhardson in a deploy now but can help in ~45m or so [22:17:20] maybe i can cordon a node, i guess i didn't actualy try [22:17:22] sure [22:20:12] ahh probably not, likely requires the admin-dse-k8s-eqiad.config which is root only (understandably) [22:23:08] ebernhardson deleting the PVC might force it to try a different node. But it might not trigger a new PVC creation so you'd run the risk of hosing the cluster [22:24:04] hmm, if it only drops that single nodes data its fine. Can't hurt much i suppose [22:30:31] gotta be something awkward with networking..but i can't guess what. This works on the other nodes but not -2: kubectl exec opensearch-semantic-search-masters-2 -- curl -vsk --connect-timeout 3 https://inference.discovery.wmnet:30443/ [22:31:23] (by works i just mean connects, not responds with anything useful) [22:47:32] back [22:49:07] ebernhardson confirmed, looks like dse-k8s-worker1028.eqiad.wmnet isn't fully ready to host, will cordon [22:54:45] These hosts were just added to the cluster, ref T418582 [22:54:46] T418582: Add dse-k8s-worker102[4-8] to the dse-k8s-eqiad cluster - https://phabricator.wikimedia.org/T418582 [23:02:49] OK, it's cordoned. If you delete `opensearch-semantic-search-masters-2` it should go to a working host now [23:04:12] lol, i should have asked earlier. spent at least an hour digging around before suspecting the node was broken [23:04:34] actualy 2 hours, based on how long the other pods have been up [23:08:06] If you still have problems let us know, I left a note in Slack so the EU folks should be able to follow up in their morning [23:08:07] yup, it's coming up now. And -4 is coming up now too [23:08:59] 1024 and 1025 (which were provisioned at the same time as 1028) appear to be working [23:10:40] Headed out, see ya tomorrow [23:13:22] thanks! I'll head out soon too...just want to see if this LD_PRELOAD thing actually worked after spending this long on it [23:24:22] * ebernhardson realizes now it might be hard to tell if it was already set to 64...hmm [23:24:50] actually we set it to 16 earlier [23:33:08] hmm, first thoughts: much worse. At least, it's taking far longer to get to a steady state. usually slow requests last ~30s after startup, but it's taking much longer now [23:35:25] (or not applied at all and thats 8mb readaheads again). the /sys/.../read_ahead_kb things all report 8, so maybe it's just not working