[07:06:48] gehel: not sure they're entirely comparable but looking at 2days of data before and after the split, p75 dropped from 48ms to 25ms (https://docs.google.com/document/d/1VKoy7n1rZ2IqbkmcwaUuNpvwklOLZoKVTgfsbtha6RI/edit?usp=sharing) [08:29:03] sigh... relforge1009 crashed again, puppet might have reset my LD_LIBRARY_PATH hack and somehow it got restart without it... [08:29:31] dcausse: how can that happen? [08:30:19] pfischer: I think on every puppet run it will erase the changes I mde locally [08:30:36] I was hoping that opensearch would not restart in this case [08:31:07] Ah, forgot about those changes being local. Because the patch has not been merged yet? [08:33:25] yes, exactly, it's there https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135430 [08:34:14] looking at the logs I'm not sure to understand, might just be me not applying my hack properly... [08:35:55] vectors were imported on relforge@frwiki, doing itwiki now [08:42:23] inflatador: I looked closer into glent and sadly I'm not sure it's easy to disable the creation of those indices only in codfw... [08:43:49] we could disable the search-loader in codfw but that might have side-effects we don't (i.e. stop some updates we'd like to keep) [08:43:56] *want [08:46:25] dcausse: the patch is pretty much straight forward, but I can only +1. Is there anybody else to ask, maybe Balthazar? [08:46:55] pfischer: yes we'd need an SRE for puppet [08:47:19] I can try to bother Balthazar :) [09:35:18] errand+lunch [13:14:38] dcausse re: glent that's OK, we can work around it [13:15:49] o/ [13:15:57] inflatador: thanks! [13:16:14] if this ever becomes a big deal we'll find a solution [13:17:39] nah, hopefully we'll be done soon [13:18:24] dcausse: I see an email about your oversight permission on test.wikidata.org. Do I need to do anything? Do you still need those permissions? [13:19:06] gehel: saw it too, thanks, no I don't need those anymore [13:59:57] banning nodes doesn't seem to be working as expected in CODFW...I suspect because banning is 'best effort' and we're short on nodes? [14:02:21] inflatador: yep, if elastic does not have a way to relocate the shards, they will stay in place [14:02:58] "reroute explain" might give you additional context: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html [14:12:34] ACK, will check. in terms of not losing data, we are OK as long as there's only one replica missing from a shard, right? As in, we can reimage the host even if it has the primary, because there are 2 replicas (except for TTM and apifeatureusage)? [14:24:09] inflatador: yes if the shards remaining on the banned host are replicas you should be good [14:24:55] TTM should definitely have 2 replicas, this index is quite important [14:27:01] dcausse what are the ttm indices named? There's one called 'ttmserver' and it only has 1 replica [14:31:24] dcausse to clarify, I was thinking it was OK to reimage a host, even if it has a primary shard, as long as there's at least one replica, since the replica would be promoted [14:35:21] inflatador: should be ttmserver & ttmserver-test, I'll bump the replicas to 2 if that's OK [14:35:52] dcausse sure! [14:36:48] inflatador: re primary shards on the banned host, this might be risky, have you encountered this scenario? [14:41:31] dcausse yeah, so far it has been OK, similar situation as if we lost a node under normal situations [14:41:45] but I would prefer to avoid if possible...that's why I'm looking closely at the ban stuff [14:42:36] elastic2075 is one example of a host that's banned, but still has primary shards [14:43:05] the other possibility is that my ban playbook doesn't work, but it works in cloudelastic and relforge so I don't think that's it [14:48:39] the exclude._name has many nodes unrelated to the chi cluster, shouldn't matter tho [14:49:42] yeah, I went with a shotgun approach to make the logic simpler...the ban cookbook does the same thing and it works as well [14:50:41] with a primary shard on a banned host it's a lot trickier to check if it's sane to remove the host... [14:52:13] I was thinking it was OK since replicas would get auto-promoted. But would it be dangerous if the replicas were out of date or something? [14:52:44] inflatador: you run yellow or green during the migration? [14:53:20] if you're yellow I'm not sure how you can determine if that primary shards is replicated or not [14:53:27] yellow at the moment in chi and psi. There are 4 unassigned shards, all replicas [14:53:55] unassigned shards must be replicas or we're in trouble :) [14:54:56] ;) for those shards, I've been checking we have at least one other replica [16:00:59] Supposedly OS can't replicate to ES, but I'm seeing OS primaries w/ES replicas, for example arwiktionary_content_1728085605 shard 0 in psi. I wonder if there is a way to tell if those replicas are actually healthy? [16:01:50] Not that it really matters since we aren't going to reimage the cirrussearch hosts, just kinda curious [16:03:40] workout, back in ~40 [16:07:07] ryankemper just a heads-up, I'm reimaging cirrussearch2075 ATM [16:12:19] I suppose the cluster could decide to allocate a primary to OS still preserving existing replicas on other es nodes, issue will arise when it tries to move that replica somewhere else [16:47:16] inflatador: I wrote a crappy bash script that checks if primaries shards on banned host have at least a replica elsewhere on a non banned hosts it's in deployment.eqiad.wmnet:/home/dcausse/are_we_safe.sh [16:47:53] it detects the tasks index which seems hosted on banned opensearch host [16:54:01] actually extended that to all shards not only primaries, because why not :) [16:59:20] dcausse thanks, I was gonna make one of those too [16:59:36] back [17:01:51] There must be a way to check a shard's lucene version, that might help us understand the OS primary/ES replica thing too [17:14:54] might be rather low-level if it exists, something at the lucene segment level, not sure that's exposed [17:15:20] there's something: https://search.svc.codfw.wmnet:9243/$index_name/_segments [17:19:01] interesting, `curl https://search.svc.codfw.wmnet:9643/arwiktionary_content_1728085605/_segments?pretty` gives me a couple of different lucene versions, 8.7.0 and 8.10.1 [17:21:26] 8.10.1 must be on opensearch nodes [17:22:22] could be that low traffic indices might not get any changes and no new 8.10.1 segments get created allowing replication to happen in the os->es direction [17:25:01] dinner [17:25:20] ryankemper cirrussearch2075 done, moving to cirrussearch21111 [17:27:24] this is our last row A non-master host, so we can try the cookbook again and probably roll with whatever row it chooses [17:59:54] looks like 2111 might give us problems...seems to hanging at PXE. Will try upgrading firmware [19:09:12] firmware upgrade didn't work, but using TFTP did. I guess EFI isn't a magic bullet for those R450 hosts [20:08:50] ryankemper I was wrong, we have 1 more row A...reimaging 2091 now [20:11:56] if you wanna try running rolling operation and just see what row it picks, we can start getting a puppet patch together. Cmd: ` test-cookbook -c 1135133 --no-sal-logging sre.elasticsearch.rolling-operation search_codfw "reimage next row" --reimage --task-id T388610 --nodes-per-run 3 --allow-yellow --wait-for-confirmation --start-datetime 2025-04-01T17:08:19 [20:11:56] ` [20:11:56] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:05:34] oh boy, Comm Error backplane 0 on cirrussearch2091