[07:03:24] sigh, readahead issue again... [07:04:01] opensearch-madvise runs only every 10 or 20 mins but with shard relocation it's probably not frequent enough [07:14:45] supposedly lucene should set madv_random with lucene 9.12.3&java21... but perhaps the jvm arg is not being passed correctly? [07:59:22] https://lucene.apache.org/core/9_12_3/core/org/apache/lucene/store/MMapDirectory.html claims we need to pass --enable-native-access=module OR all for it to work but our jvm.options is kind of obsolete apparently [08:32:51] hm not clear why read ahead settings are different across nodes cloudelastic1007/1009 still reports 8386560 and 11 & 12 reports 6144 [08:35:46] re lucene and madvise I don't see evidence that it needs --enable-native-access, at least elastic jvm.options do not have it, they even enable more aggressive read-aheads by default (https://github.com/elastic/elasticsearch/blob/main/distribution/src/config/jvm.options#L65) [08:36:35] we should definitely review our jvm.options, seems like we've been carrying out the same set of options for years [08:38:21] anyways won't touch anything for now, it's backfilling at least... metrics report around 4days to fully absorb the backlog [13:18:31] \o [13:21:22] yea it's amazing how many times readahead comes up. Maybe we need a checklist for IO saturation that starts with double check the readahead :P [13:25:33] also i don't know why i didn't realize this before...but the regex acceleration we have is based off google code, and they published both blogs and code about how it works: https://swtch.com/~rsc/regexp/regexp4.html [13:25:55] so i guess will have to dig through that to understand where our impl is going wrong [13:25:57] o/ [13:27:19] re readaheads, I'm still puzzled by the madvise utility did it just help to apply the read-aheads quicker or is it necessary to have MADV_RANDOM on all lucene files [13:27:45] from I can see by default lucene does apply MADV_RANDOM only on a subset of file types [13:27:59] dcausse: MADV_RANDOM basically turns off readaheads, it's a kernel hint that we expect to use random access [13:28:54] but do we know how it performs with reasonale readaheads and default lucene behavior for madvise? [13:28:55] it works for mmap because the readahead gets copied into a dedicated per-mmap data structure when mmap is first called, my understanding is madv_random reaches into that mmap data structure and changes the readahead [13:29:02] hm [13:29:44] because I just stumble on https://github.com/apache/lucene/issues/14408 where they discuss that MADV_RANDOM might also come with some drawbacks [13:30:43] i'm reading [13:33:05] the vector stuff i've definately seen, semantic search ran terribly when i completely disabled read ahead [13:35:50] and https://github.com/elastic/elasticsearch/blob/main/distribution/src/config/jvm.options#L66 where elastic seems to explicitly disable madv_random [13:35:50] kinda sad, the ticket reads as giving up :( They need different results under different circumstances, declared it very complicated, then someone decided giving the user two options and no fine grained control solves the problem of being complicated? (not really...) [13:36:01] yes... [13:36:44] i do wonder if part of it is that our use case is not the primary use case, but i never know [13:37:06] for example the amazon person on the ticket was part of amazon product search, so also doing the same as us [13:37:10] similar at least [13:37:55] maybe we can't get rid of our tool :S might have to add something about file extensions to skip like .vec [13:39:10] so mainly I'm just wondering: is opensearch-madvise just a workaround for bogus default readaheads or strictly required for us, could the perf could be better/acceptable if the default readahead was sane and without this utility running? [13:39:53] dcausse: it's absolutely a workaround for over-agressive readaheads. IIRC when i benchmarked way back in the day it just kept getting better when i lowered the system readahead, and better still with madv_random [13:40:00] but that was many lucene/elasticsearch versions ago [13:40:13] i'm trying to remember why we didn't only change the kernel readahead [13:40:49] a file type exception would require giving up on compound files (.cfs) [13:41:08] i didn't realize it had that :S hmm [13:41:44] in theory...as long as the readahead is low enough that IO doesn't explode, that's probably where we need to be. Not sure how to be certain though [13:42:39] for cloudelastic it might not be obvious to see, with sane read-aheads it might just work OK with default lucene madvise but for prod it's harder to anticipate [13:42:49] i wonder if i was doing the MADV_RANDOM thing simply because I was familiar with the idea, and you can reach in and change that at runtime whereas the mmap's usually require a full cluster restart (which was a multi-day effort back then) [13:43:44] yes it's kind of powerful to be able to alter runtime perf preferences on mapped files [13:44:43] oh the other thing to keep a check on when we do prod, cloudelastic spent hours shuffling shards around. I found evidence that lucene/opensearch changed behaviour around shards. They now also balance primary count across the cluster and a few other things got integrated into the shard movement bits [13:45:02] I suppose my worry is cloudelastic was only 6 hosts, if we do it on 50+ that might be a ton of network traffic [13:45:13] sigh... [13:45:16] but it might have instead been related to shards dropping out of the cluster [13:46:30] ah heavy shuffling might also explain some things, opensearch-madvise is a timer, but the bogus read-aheads might have had time to its bad things on moved shards before the timer kicked in [13:46:52] yes, also combined with the default query pattern of "run 9k shard queries" [13:47:09] (i forget how many, it's probably not actually 9k, but it's alot :P [13:47:10] yes... [13:48:32] additional question is how could we have 8386560 ra in the first place, that's 4Gb... something's definitely broken in our tooling? [13:54:39] yea...it seems like it was trying to be set to 8MB, but it was sectors. I don't understand where that came from either, it's a crazy number [13:57:24] yes, found nowhere where it's explicitly set, wondering if there are some odd calculation that went completely offrails because of the raid setup [14:03:00] i found some evidence that the kernel scales up readahead on raid 0 based on the number of disks, but still that a *3 multiple. Doesn't do enough [14:04:02] yes... also Ben found that when fixing the udev rule, 32 was applied instead of the expected 16... why *2? so weird... [14:12:21] don't quite undertand, the kernel does `io_opt = (mddev->chunk_sectors << 9) * (mddev->raid_disks)`, and then `ra_pages = max(io_opt * 2 / PAGE_SIZE, VM_READAHEAD_PAGES)`. My reading is it's basically pulling some state from the disk, multiplying and adding across disks [14:13:09] that led to: https://lkml.org/lkml/2025/7/29/1184 "md/raid0,raid4,raid5,raid6,raid10: fix bogus io_opt value" in jul 2025. We are on 6.12.74 release in feb 26 though..unclear what kernel versions that patch is in or if it's relevant [14:15:27] the patch description lines up though, mentioned `/sys/block/md0/queue/optimal_io_size` having a crazy high value. On 1008 that is set to 16773120 [14:16:08] on cloudelastic1011, where we didn't have problems, that repots 0 [14:16:12] *reports [14:18:27] seems related at least [14:23:10] claude suspected it was firmware versions, checking all the instances the nodes with massive readahead have fw=DZ02, the nodes with "normal" readahead have fw=0141 [14:23:15] ssd firmware [14:24:43] with other consequences but seems like large values in /sys/block/md0/queue/optimal_io_size caused other issues (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1207150) [14:24:59] they also report different, the "good" instances report "model=INTEL SSDSC2KB01" the "bad" instances report "model=HFS1T9G3H2X069N" [14:25:27] the model string is apparently a DELL disk [14:25:49] hmm [14:26:56] from filipo: 'Or an "optimal i/o size" of ~4GB (!), which is clearly wrong and confuses lvm. The raid component devices report 16MB:' [14:33:17] inflatador: if you get a chance, can you have cumin run this across the fleet. I suspect it doesn't apply to the main clusters, but wanted to check: https://phabricator.wikimedia.org/P92330 [14:33:25] probably just one cluster is fine [14:39:22] ebernhardson sorry, catching up [14:41:14] inflatador: the main theory is 1008/1010 have different disks than the other hosts, and a bug in the linux kernel with those disks firmware is the source of the readahead [14:43:37] 1007 & 1009 had also bogus readaheads and they run HFS1T9G3H2X069N as well [14:43:53] hmm, curious they didn't blow up like the other two [14:44:21] suspecting they got luckier with opensearch-madvise? [14:44:50] hmm, yea that's possible. 1008 was constantly on the recoveries list. With them dropping shards then re-initializing them, seems plausible [14:45:22] maybe once a node or two slowed down enough it just wasn't pushing the others anymore because the whole cycle was slow [14:45:31] sure [14:45:42] 1009 did blow up for awhile if you look at the dashboard [14:48:37] feels like closing in on a cause at least :) [14:49:08] As far as firmware and partition alignment, why would we be affected on OpenSearch 2.x/Trixie but not on OpenSearch 1.x/Bullseye? Is it just the shard shuffling behavior, changes to mmap handling in Lucene, something else? Sorry if I missed something that was already said [14:50:17] inflatador: the theory is the kernel decides something called `io_opt` based on information the firmware reports to the kernel, but the kernel must be getting some information back that is incorrect because it results in a very large io_opt value. That io_opt value is then used to determine the default readahead [14:51:24] when io_opt is 0, like the intel ssds, then we get the kernel default readahead. But with the dell ssds it's reporting a high value, which results in an overly massive readahead. I think it's the same problem as https://lore.kernel.org/all/ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5/ [14:52:30] smells we might need to update a bunch of prod hosts' firmware soon, oh joy [14:52:52] i think it would be the SSD firmware, but hard to be completely certain since the disks that work and fail are completely different models [14:53:48] also similar problem fillipo found: https://phabricator.wikimedia.org/T407586 [14:53:52] in trixie [14:54:00] (from david earlier) [15:11:20] will ship the updater with opensearch 2 client to prod, hoping it'll just work, I don't really want to maintain 2 versions while we're we upgrade the cluster... [15:11:31] +1 [15:14:23] Any objections to inviting Niklas to the Weds mtg tomorrow to talk ttmserver migration/translatewiki? [15:14:31] fine by me [15:20:12] sure [15:20:42] OK, done [15:20:56] Not sure if he will actually show, he mentioned in Slack that he might have to skip part of another mtg [15:21:55] we are flexible on timing, although i suspect nik is in EU. He could come an hour later and we will still be there [15:22:00] but later isn't always better [16:10:57] meh, asked claude to compare the google codesearch trigram acceleation implementation to ours, see if it finds anything suspicious. It spun for 20 minutes and ran out of quota :P [16:11:57] i was going to anyways, but i guess that means i need to figure out how that codebase works [16:13:13] :) [16:19:25] q: I'd like to be able to do a `morelike` search on a draft article but returning articles in the main namespace. is this possible? I've got it working to return similar articles from the draft namespace but it seems constrained to that (also I'm not sure why I need to use `morelikethis` in the below example). [16:19:27] example: https://en.wikipedia.org/w/index.php?search=morelikethis%3A%22Draft%3AArmor_Wars_(film)%22&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&ns118=1 [16:20:53] isaacj: hmm, morelike and morelikethis are the same thing, the main difference is the morelike is legacy a greedily eats the entire string. with morelike the query `morelike:Foo intitle:bar` would try to find the title `Foo intitle:bar` [16:20:55] the other part, lemme check [16:22:31] hm.. morelike might not work crossindex, the rewrite part might happen too late? [16:24:01] oh, are different namespaces also partitioned separately (and not just wikis, which I was aware of)? [16:24:16] isaacj: ahh, the morelike vs morelikethis for your query is because morelike also doesn't support quotes. Basically morelike couldn't be fit into our more normalized query syntax, but also couldn't be removed due to usage so it's just kind of an awkward thing that doesn't fit the normal expectations of the query language [16:24:55] yes there is an index for content, and then an index for everything else. The justification was that everything else has way more text than the actual content, and it would throw off all the term statistics. Having a separate content index ensures the term stats represent the corpus [16:25:05] excellent - one mystery solved! I had always been confused about morelike vs. morelikethis but that helps understand how to select between them [16:25:48] and yes i think david is right, when we issue the query we tell the search engine "more like page 12345", but page 12345 isn't in the content index [16:25:48] we used to have a separate method by pulling the text from mw and shipping morelike with a text but I think we dropped that feature, we could possibly re-add it [16:26:42] seems reasonable to detect the page is not in the index, we still have the code for fetching content (enabled by a dev flag) [16:27:10] isaacj: do you want to file a ticket? No promises on when, but it shouldn't be too hard [16:29:10] ahh drat and I see that Draft isn't considered a content namespace either. dcausse : that would allow for cross-index searches then like what I want? [16:29:41] if so then yes, happy to file a ticket (and no urgency -- this was a prototype to help find improvements that could be made to Draft pages by finding similar ones that are published articles to pull recommendations from) [16:30:48] isaacj: yes, morelike is basically two steps, fetch the content and then run a query with this content to find similar candidates, fetching the content is currently delegated to elastic but could also be done with an extra step in MW allowing cross-index morelikes [16:31:14] :thumbs up: I'll file and share back here then. thanks! [16:31:21] sure [16:36:00] oh actually more_like_this seems to support an "_index" param, might just do what we need? [16:40:35] sounds plausible, ya [16:40:52] that makes it way easy [16:41:08] Here's a ChatGPT summary of the cloudelastic issue based off the IRC scrollback, ebernhardson dcausse does it look plausible to you? https://docs.google.com/document/d/1kM1aTar35pOq3mizCGQGhW5kK7BnjOIQjKiUVSjcNVE/edit?tab=t.0 [16:41:50] I'm mainly trying to figure out how we prevent this from happening in prod. Sounds like FW upgrades are in order, anything else? [16:42:04] also I haven't forgotten about the cumin request, trying https://phabricator.wikimedia.org/P92330#374898 shortly [16:46:37] yes confirmed with a local query and it should be fairly easy to enable cross-index morelikes, we could even consider this a bug I think, for Draft:Armor Wars (film) it likes "James Rhodes (Marvel Cinematic Universe)" [16:48:37] inflatador: hmm, like 99% is good. I would put shard shuffling on the consequences side, rather than the cause side, but hard to say for sure [16:48:54] https://phabricator.wikimedia.org/T425442 [16:49:53] inflatador: I'd add that profile::opensearch::cirrus::storage_device was misconfigured causing the udev rule to not apply the custom read ahead for cirrus [16:50:21] well... if it worked we would not have discovered this insane readahead values on these disks [16:50:24] I knew that was fragile when adding it :( Should have returned to it at some point [16:50:26] isaacj: thanks! [16:50:58] happily - thank you! [16:52:08] dcausse ACK, I see from Slack that the udev rule is fixed. But you're saying that udev rule wouldn't have been enough even if it was correctly written? [16:53:29] inflatador: sorry, no I'm saying that if it had worked we would never have discovered this bogus readahead value [16:54:31] this codesearch impl is way different from ours...it does not walk an automaton. Instead it walks a syntax tree [16:54:42] dcausse ACK, thanks for clearing it up. [16:54:43] not sure that's critically important to know in the end but good to remember that default readahead may go insane for some reason [16:55:03] Yeah, I need to make that more clear in our docs ;) [16:55:16] Here's the full results of the cumin command: https://phabricator.wikimedia.org/P92330#374916 [16:57:07] inflatador: thanks! With every single node reporting the same optimal_io_size, and including (i think, didn't strictly compare the numbers) the some of the same disks as cloudelastic, it does seem like the kernel upgrade / debian upgrade are a likely cuase [16:58:13] inflatador: also related to profile::opensearch::cirrus::storage_device but when checking a couple cirrussearch* hosts it was unfortunately on the wrong device [16:58:36] perhaps a udev rule can work on the mount point instead of the device? [17:00:05] Yeah, I can follow up on that too. As far as firmware, I haven't checked yet but I wonder if it's the same drives as T394348 and we just missed some [17:00:05] T394348: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348 [17:00:57] dinner [17:03:03] seems plausible, the description of the problem being solved doesn't line up (it says its a drives-die-early problem) but could have the fix [17:22:11] Nope, looks like those hosts have fw like `DL7C` vs our `DZxx` hosts in cloudelastic. Not that that rules out needing a FW update [17:42:34] The more i look at the google codesearch trigram method vs ours...it seems like walking the syntax tree is way simpler than walking the lucene automata. The lucene automata turns 'abc*(def)*ghi' into a complex thing, because the automata has to loop back. but the syntax version just walks the syntax [18:07:35] but what i can say for certain, that implementation does not short-circuit the regex walking like we do :( [18:24:26] looks like postgres has an implementation that walks an NFA though, might be relevant [18:25:44] it's a little annoying to read though because postgres regex's work on "colors" rather than character classes. They essentially assign sets of (potentially non-contiguous) characters to colors, then all matching works on colors [18:26:40] they also have multiple transitions out of a state, rather than lucene that makes intermediate states [18:37:37] wonder if its true, gemini claims that the regex with colors represents a logarithmic, or some times even linear, reduction in memory footprint of the state transition table. It's specifically about dealing with utf-8 [18:38:58] nifty idea though, it basically turns `\dabc` into the same as matching `abcd` since it first maps all numbers to a single color [18:42:28] Trey314159: appreciate your comment on https://gerrit.wikimedia.org/r/c/wikimedia/portals/+/1282941/1, thank you :) just acknowledging that i have seen it and will get back. [19:49:06] A_smart_kitten: Sounds good!