[07:15:30] pfischer: thanks! [08:04:44] the network issue seems to be due to cloudelastic1012 being misconfigured (T425300) [08:04:45] T425300: lvs on cloudelastic1012 is misconfigured - https://phabricator.wikimedia.org/T425300 [08:05:25] but more annoying is that cloudelastic is now red (T425301) [08:05:25] T425301: The cloudelastic chi cluster is red - https://phabricator.wikimedia.org/T425301 [09:27:37] errand [10:20:16] lunch [12:14:11] sigh a bit pessimistic that the red indices will recover from disk... unless the data is on the last node I'll restart... :( [12:15:23] will be 3 indices to recover from dumps, and obsiously not the smallest ones, commonswiki_file, dewiki_content and commonswiki_content [12:49:47] dcausse: Hrmpf, do we have an estimate how long it will take? [12:50:17] pfischer: not really... several hours for sure [12:50:33] restarting the last node... [12:53:50] Trey314159: Would you mind if we start our 1:1 earlier today? I do have a conflicting meeting. [12:58:29] phew, they're back... they needed a '/_cluster/reroute?retry_failed=true' [12:58:40] ok resuming writes [13:14:05] not sure what to expect when monitoring the growth of /proc/$opensearch_pid/maps... it's at 25k now and keeps growing, still far from the new 1M limit [13:15:24] o/ [13:19:11] o/ [13:24:11] afraid of https://github.com/opensearch-project/OpenSearch/issues/19482 not helping with our case... can't remember if export segment stats regularly to prometheus, looking [13:25:58] \o [13:27:56] nice find on lvs! that would certainly explain it. [13:28:52] t.opranks actually found it, just saw it on another channel [13:29:06] I depooled the node in the meantime [13:29:30] but more concerning is this possible map leak issue... [13:30:45] bah the cluster is red again... [13:30:55] hmm, i'm not sure what to expect there either [13:34:45] i guess i can poke the regex problem again, i have a "fix" but after thinking it all through, the fix is not fixing the root cause, it's a bandaid. I think the fundamental algorithm involved has a problem where the expression is too complex for the algo but it doesn't notice [13:35:02] but it's worked this long...maybe bandaid is good enough [13:36:02] like there is a point in the algo where if we have walked too many transitions it just moves everything we are supposed to walk to the accept states (and all accept states are converted to trigrams and booleaned together) [13:51:47] but it also has to have some termination point, because if you just walk the automata an expression like `abc.*xyz` never terminates, the .* just keeps going untill we hit too many states. [15:00:55] I am still stuck in the Enterprise exchange meeting, will be late for Triage [16:06:45] * ebernhardson isn't a fan of modern linux mounts, there was a time when i could type `mount`, see only the disks, and know whats going on . Now there are 40 other random things there :P [16:08:02] md2 (at least on 1009) is raid 0 (striped), which implies we should be able to get 500*3 as the actual max [16:08:51] cluster overview shows a couple of the servers hovering at 500MB/s writes for a bit which seems suspicious, being close to the disks documented 540MB/s throughput, but with raid 0 that shouldn't be as meaningful [16:10:04] other random guesses, we could push up the refresh interval but i don't expect much from that [16:19:26] dinner [16:22:12] hmm, opensearch 2 changed deletes_pct_allowed from 33% to 20%, i wonder if that is triggering more merges than normal [16:22:24] but i've changed that on prod clusters before to little fanfare [16:29:14] one possibility (probably not for testing today), i opensearch 2.x now has segment based replication, which would cut indexing and merging load (in theory) [16:30:48] it also looks like lucene 9 turned on merge_on_flush by default, seems like it could cause IO load [16:40:12] my best guess though is it has to be merges, < 1k iops and 500MB/s means the writes are large, and i just cant see small segment merges doing that [16:52:09] sigh. yea on the instances themselves i'm seeing 1008 (while recovering shards...because i don't know why) it's maxing everything. iostat reports 550MB/s for each of the three disks, 1.6GB/s total [16:52:15] it's only 1008 though [16:56:49] reduced recovery.max_bytes_per_sec on cloudelastic chi from 756mb to 256mb [17:08:38] cluster is staying green, although it keeps shuffling indices around. dashboard shows updater pushing 400 ops/sec. Will keep an eye on it [17:09:14] it did drop like 6 shards when i first started up the updater though, no clue why :S Issued the same reroute?retry_failed=true while it was yellow and it recovered [17:10:04] cool, still working on that privatemounts patch. First I need to figure out why we're not getting the opensearch2 template on cloudelastic ;) [17:24:23] oh, we are...just need to explicitly disable security plugin in hiera [18:00:40] lol, cloudelastic is still shuffling shards around :P I wonder if they changed how they decide [18:05:15] restarting psi on `cloudelastic1012` to see if the service can start with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282399 applied [18:06:34] the answer is yes. I'm headed to lunch but will make a patch for removing the privatemount stuff when I get back. [18:21:35] hmm, regarding routing it looks like 2.x does have some new things around how it balances shards around the cluster. In particular they now enable primary-shard balancing, where it tries to keep an equal number of primaries on each instance [18:21:44] not sure if that's why cloudelastic keeps shuffling, but would make sense [18:41:47] back [19:08:01] ebernhardson how can I tell whether the cloudelastic streaming updater is running, is it just whether or not it's deployed in wikikube? [19:08:17] inflatador: look at the 'actions processed rate' on [19:08:20] https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&from=now-15m&to=now&timezone=utc&var-k8sds=000000017&var-opsds=000000006&var-site=eqiad&var-app=flink-app-consumer-cloudelastic&var-operator_name=rerender_fetch_failure_sink:_Writer&var-fetch_operator_name=fetch&var-fetch_operator_name=rerender_fetch&var-quartile=0.5&var-fetch_by=authority&var-top_k=1 [19:08:26] oh thats a terrible url... but the 'Cirrus Streaming Updater' dashboard [19:11:29] ebernhardson ACK, so it looks to be on then. And the "Consumer (Flink) Kafka Lag" is the backpressure? Just asking because I'm about to disable the privatemounts stuff and wanna make sure I know how to measure the difference [19:12:35] yes, the lag is from being turned off for multiple days, will probably take multiple days to clear. But as long as it's going down it's probably ok [19:14:38] cool, I'll also be looking at I/O PSI https://grafana.wikimedia.org/goto/cfl2pi7vjnpj4c?orgId=1 [19:15:48] school run, back in ~30 [19:32:41] I don't get it. Apparently the hiera change from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282399 didn't do anything, even though PCC said it would. I guess I should've looked at the actual template after Puppet ran [19:58:41] back [20:05:19] cloudelastic is STILL shuffling shards around...we are going to have to expect that on the prod clusters too [20:05:44] ebernhardson I'm roll restarting the cluster now, so that explains at least some of it [20:06:01] oh, ok that might partially explain it [20:15:13] cloudelastic-chi just went red :S [20:16:16] issued a reroute force and it's now recovering more...but we need to understand that better :S [20:16:31] I restarted 1008, it should recovery shortly [20:16:41] yea but it still shouldn't go red [20:16:53] that means all 3 copies are missing [20:17:19] But yes I agree. since we started the whole yubikey thing I can't use ansible and it makes it a lot clumsier to coordinate this stuff, esp since our puppet code and cookbooks seem to be broken ATM [20:18:02] I need to start a ticket for the cookbooks, apparently OpenSearch 2 doesn't like 'True', you have to give it 'true' in the API [20:18:56] I think what's happening is that the node holding the primary is so bogged down it doesn't respond in time and the rest of the cluster decides the shard is gone [20:19:05] hmm, seems possible [20:19:56] that's what I was seeing in the logs last week, errors kinda like "I have version 6 of this shard, but I've seen version 7 before so I won't promote this one" [20:23:45] so far the privatemount stuff doesn't seem to have made a difference, but I'm not going to restart the main cluster again until it's out of the red. Will add my notes to ticket [20:26:51] hmm, active_shrads_percent was going up but now it went back down again :S [20:27:02] but there are a lot of transfers active as well [20:27:39] i think the problem is it's just too much IO still, i see two of the hosts hitting 1.5GB/s on the disk [20:28:07] reduced indices.recovery.max_bytes_per_sec to 128mb on cloudelastic-chi [20:28:13] (from 256, was 768 this morning) [20:28:35] ACK, I stopped the small clusters on 1008 [20:29:17] not sure how long that will take to have an effect, not seeing any immediate change :S [20:29:24] 1008 is the one that's completely stalled out on I/O [20:29:33] yea 1008 and 1010 are both seeing 1.5GB/s [20:29:45] 3 disks, rated at 540MB/s, so i'd say they are maxxed :) [20:30:02] (via iostat) [20:31:22] going to try 20mb/s, not for long but just to make sure it does something... [20:32:09] still 1.5GB/s :S [20:32:15] give it a few minutes i guess.. [20:33:02] Still trying to understand what changed [20:35:21] i'm confused too :( I was thinking maybe merges are dominating, but there is 1 merge running in the cluster [20:35:51] been 4-5m since the shift to 20mb/s, disk is still at 1.5GB/s [20:35:59] i guess i should be cross-referencing network traffic... [20:36:58] network isn't actually that high, 1008 is almost idle on the network :( [20:38:47] 1008 and 1010 are stalling out on I/O, I wonder if they have more commonswiki shards than the others? [20:40:24] 1008 just stopped using it's disks much, i don't know why :P [20:40:53] figures it was the same moment i started up pidstat to verify where it was coming from... [20:41:54] the cluster thinks both of them have no commonswiki shards, probably because they're so unresponsive [20:42:17] 1008 just started up the disk access again, so whatever it was stopped for a minute or so [20:43:27] iostat is showing ~250ms of latency just accessing a disk [20:43:42] (assuming thats due to the queue depth, 1800 pending requests) [20:43:42] pid 1748493 (main OpenSearch process for chi) seems to be the culprit based on atop [20:44:37] it's almost all reads, minimal writes. So that also aligns with it not being merges, i guess it could be too many recoveries that start from there? But seems unlikely with (current) 20mb/s cap [20:45:19] inflatador: as a random guess, can we stop the 8* nginx instances and see if stopping the query load makes a difference? When most queries access every single index maybe that makes it worse [20:45:39] our updates flow to 9* ports, reads from cloud are on 8* [20:45:50] ebernhardson sure, you want me to stop everywhere or just on 1008? [20:46:19] inflatador: has to be all of them [20:46:48] Got it. On chi or all clusters? [20:48:37] inflatador: chi should be sufficient, the others are small enough [20:49:30] read rate going down [20:49:46] well, maybe i spoke too soon, a few good 10s slices but unsure [20:53:32] well, that broke nginx completely. Cumin seems to have problems with escaping commands [20:55:56] poking through _tasks to see if anything looks suspicious...but it's long [20:56:29] annoyingly everything is in nanoseconds, so i have no clue glancing at things if something ran for 10s or 100s.... [20:57:19] `unlink /etc/nginx/sites-available/cloudelastic-chi-eqiad-ro` doesn't seem to work from cumin, and entering SSH pass + pin 6 times is not great ;( [20:58:56] oops, that's on me [20:59:27] working up a jq bit to parse _tasks output, hopefully find something curious [21:00:42] OK, reads are off on chi [21:00:57] hmm, looking at things running >10s, it's all saneitizer doing direct id lookups, should be cheap [21:01:05] they probably take 10s because of how slow IO is [21:05:13] recovery claims nothing is running now, still maxing IO :S [21:06:21] 1010 is back to low IO, i wish i could understand what changes when it goes low for 30s+ [21:07:17] and now it's back [21:08:17] I'm reverting https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282409 as it doesn't seem to have an effect and we won't be able to restart the cluster 'till it's back cc ryankemper [21:09:03] hmm, `sudo blockdev --getra /dev/md2` : 8386560 [21:09:06] on 1010 [21:09:43] trying `sudo blockdev --setra 16 /dev/md2`, see if it helps anything [21:10:26] applied on 1010 and io dropped, 1008 is not responding yet [21:10:39] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1282417 [21:11:04] (not1 100% certain on 1010 drop though, it keeps going in and out of massive io) [21:11:39] ebernhardson is that for readahead? You might try setting it on the block devices directly instead of the MD's [21:11:57] not that I really know if that's better [21:12:13] yea it's the readahead, just pulling on random threads that cause IO [21:14:28] another thread I haven't pulled on is memory usage, I see memory stalls as well https://grafana.wikimedia.org/goto/bfl306qztxhj4b?orgId=1 [21:14:58] oh right, we have to restart the instance to get new readaheads [21:15:02] going to restart chi on 1008 [21:15:13] 👍 [21:18:05] IO is currently low on 1008, but waiting on it to pick up all the pieces still [21:18:28] i kinda overdid it..applied 16 sector readahead to sd[abc] and md2 [21:18:38] wasn't sure which would work :P [21:18:51] pulling recovery speed back up to 128 [21:21:20] interestingly, it looks like 1008 was memory-starved as well: https://grafana.wikimedia.org/goto/afl30st3rcikgf?orgId=1 [21:21:46] I think it was probably I/O bound first though [21:21:50] I'm always sus of other stats when iowait is high...perhaps i'm indexing too much on decades ago, but my experience has been linux always behaves terribly when iowait is high [21:22:03] like all kinds of things get blocked out [21:22:12] yep [21:22:46] 1008 is keeping very low io, but i'd like it to come back from red so i can restart 1010 as well [21:23:02] 1010 is currently showing low io as well, not sure why though it wasn't restarted (buti did change the readaheads) [21:25:23] yeah, looks like we don't have any more red indices [21:25:41] ahh yup, there it went back yellow [21:27:52] also i just checked the other hosts, the ones that weren't blowing up reported 6144 to --getra, 1008 reported 8386560 [21:28:05] if this was readahead again....it's amazing how many times we can be burned by the same problem and not realize it :P [21:30:41] 1010 is showing some elevated io (we didn't restart it yet), and 1008 is staying happy. Might be the thing...once it goes green i'll restart 1010 [21:36:54] Does that mean the `opensearch-madvise` binary needs updating, or were some hosts just not restarted after it was applied? [21:37:07] oh i didn't even check if that was present, hmm [21:38:22] Moritz helped me cut a new version of the package for Trixie [21:39:52] i see it installed, and i see the systemd timers. journalctl has no entires [21:39:55] entries [21:40:08] basically `opensearch-disable-readahead-cloudelastic-chi-eqiad.timer` is empty, hmm [21:41:10] manually invoking it, output looks reasonable [21:41:28] the timer is empty on a prod host too, but the service should have journal entries [21:41:40] like `journalctl -u opensearch-disable-readahead-production-search-eqiad.service` should show something if it's working [21:42:18] running it manually against the chi pid and data path, it isn't clear if much changed. It would be easier if 1010 was back to 1.5GB/s, it's really unclear why 1010 brough down it's io when i restarted 1008 [21:44:18] i guess 1010 was already being intermittent though, sometimes not having such a problem, 1008 was the main one that kept getting stuck [21:45:13] also commonswiki shards are up to 70G now, i think it was 50G before? [21:45:29] Looking back a week it seems like other hosts also had prolonged periods of I/O stalls https://grafana.wikimedia.org/goto/efl32xy31knb4a?orgId=1 [21:46:07] hmm, goto doesn't seem to work from grafana-rw, it sends me to the homepage [21:47:04] $@%$#% grafana links [21:47:19] * ebernhardson kinda wishes he had a unified bash history across all instances [21:47:28] sounds terrible to maintain, but it would be nice :P [21:47:34] * inflatador needs a greasemonkey script for grafana links [21:48:17] I also need to fix that dashboard so it has sane defaults [21:48:26] pushed recovery max_bytes_per_sec back to 256mb, not sure whats best there but i want this to hurry up and get back to green, and network/io both seem low [21:48:51] https://grafana.wikimedia.org/goto/ffl339k39dekgd?orgId=1 does that link work better? [21:49:08] yup, that sent me to i/o presssure stall information [21:49:23] when did we upgrade versions? [21:50:25] Looks like the first host was reimaged on 4/20 [21:50:29] T422860 has more details [21:50:29] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [21:51:03] ok, so all the io stalls are post-upgrade [21:51:06] and post-reimaging? [21:51:09] must be [21:51:10] yep [21:51:59] if this does work...going to hvae to try and understand what happened with the madvise, maybe it's conflicting with something lucene is doing (lucene can also madvise in 2.19 if we have java 21, and we do) [21:52:16] but with this on a loop, it should just be overriding what lucene is doing [21:52:41] We had a decent sized cluster in K8s for Semantic Search, but I have no idea how comparable it is to production [21:53:50] the biggest difference is probably the update rate [21:54:11] although historically we've had very little problems with indexing pressure [21:54:35] If we wanted to recreate the search indices from scratch, is that a MW script or airflow job? [21:54:39] back to green, restarting 1010 [21:55:30] now my isp decided it hates me...64 bytes from 1.1.1.1: icmp_seq=6 ttl=54 time=2666 ms [21:58:14] inflatador: i suppose, lets turn nginx back on and see if it continues being happy? [22:01:02] ebernhardson OK, gimme a sec. Also, my question about recreating indices from scratch is more about the TTMserver migration. I imagine they would have to be recreated from the Mediawiki SQL data? [22:01:16] inflatador: yea, but iirc it's not too bad [22:02:07] 5.5M docs and 5Gb primary store, so i guess not super lightweight, but pretty sure it's not too hard to reinitialize [22:05:34] seeing some significant IO increases, assume hats turning the requests back on? It's not maxed out though, might need a few minutes to populate the 150GB disk cache properly [22:10:01] back down to what seems plausibly normal, tempted to declare root cause found [22:14:10] The root cause being the `opensearch-madvise` service wasn't started properly? [22:14:29] well, readaheads in general. I'm not sure why opensearch-madvise didn't work [22:14:41] but i think the general perf problem is likely over-agressive readahead saturating IO [22:15:31] running out of time today, but also not sure how to validate those. In the past i used the kernel page-types tool to identify the % of pages that were pulled in for readahead but never accessed [22:15:47] i'm not sure why, but our kernel builds don't build those tools (never did, we had to get a one-off build from moritz back in the day) [22:16:46] i'm sure there is a reason, maybe we are the only people that ever found it useful :P [22:18:27] I'll post an update on T424852. What do you think did the trick? `sudo blockdev --setra 16 /dev/md2` and similar commands, or was it restarting `opensearch-madvise` ? "I don't know" is a fine answer too ;) [22:18:27] T424852: Investigate performance issues in cloudelastic - https://phabricator.wikimedia.org/T424852 [22:19:31] inflatador: the blockdev command sets the readahead, but because lucene uses mmap the readahead settings are only checked when the file is first mmapd, restarting causes all the files to be reopened and get the new readahead [22:26:35] I'm out of time for today, but we can try a rolling restart tomorrow to see if things stay calm [22:31:32] sure, everything seems in a decent plae now