[07:06:48] <dcausse>	 gehel: not sure they're entirely comparable but looking at 2days of data before and after the split, p75 dropped from 48ms to 25ms (https://docs.google.com/document/d/1VKoy7n1rZ2IqbkmcwaUuNpvwklOLZoKVTgfsbtha6RI/edit?usp=sharing)
[08:29:03] <dcausse>	 sigh... relforge1009 crashed again, puppet might have reset my LD_LIBRARY_PATH hack and somehow it got restart without it...
[08:29:31] <pfischer>	 dcausse: how can that happen?
[08:30:19] <dcausse>	 pfischer: I think on every puppet run it will erase the changes I mde locally
[08:30:36] <dcausse>	 I was hoping that opensearch would not restart in this case
[08:31:07] <pfischer>	 Ah, forgot about those changes being local. Because the patch has not been merged yet?
[08:33:25] <dcausse>	 yes, exactly, it's there https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135430
[08:34:14] <dcausse>	 looking at the logs I'm not sure to understand, might just be me not applying my hack properly...
[08:35:55] <dcausse>	 vectors were imported on relforge@frwiki, doing itwiki now
[08:42:23] <dcausse>	 inflatador: I looked closer into glent and sadly I'm not sure it's easy to disable the creation of those indices only in codfw...
[08:43:49] <dcausse>	 we could disable the search-loader in codfw but that might have side-effects we don't (i.e. stop some updates we'd like to keep)
[08:43:56] <dcausse>	 *want
[08:46:25] <pfischer>	 dcausse: the patch is pretty much straight forward, but I can only +1. Is there anybody else to ask, maybe Balthazar?
[08:46:55] <dcausse>	 pfischer: yes we'd need an SRE for puppet
[08:47:19] <dcausse>	 I can try to bother Balthazar :)
[09:35:18] <dcausse>	 errand+lunch
[13:14:38] <inflatador>	 <o/
[13:15:42] <inflatador>	 dcausse re: glent that's OK, we can work around it
[13:15:49] <dcausse>	 o/
[13:15:57] <dcausse>	 inflatador: thanks!
[13:16:14] <dcausse>	 if this ever becomes a big deal we'll find a solution
[13:17:39] <inflatador>	 nah, hopefully we'll be done soon
[13:18:24] <gehel>	 dcausse: I see an email about your oversight permission on test.wikidata.org. Do I need to do anything? Do you still need those permissions?
[13:19:06] <dcausse>	 gehel: saw it too, thanks, no I don't need those anymore
[13:59:57] <inflatador>	 banning nodes doesn't seem to be working as expected in CODFW...I suspect because banning is 'best effort' and we're short on nodes?
[14:02:21] <gehel>	 inflatador: yep, if elastic does not have a way to relocate the shards, they will stay in place
[14:02:58] <gehel>	 "reroute explain" might give you additional context: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html
[14:12:34] <inflatador>	 ACK, will check. in terms of not losing data, we are OK as long as there's only one replica missing from a shard, right? As in, we can reimage the host even if it has the primary, because there are 2 replicas (except for TTM and apifeatureusage)?
[14:24:09] <dcausse>	 inflatador: yes if the shards remaining on the banned host are replicas you should be good
[14:24:55] <dcausse>	 TTM should definitely have 2 replicas, this index is quite important
[14:27:01] <inflatador>	 dcausse what are the ttm indices named? There's one called 'ttmserver' and it only has 1 replica
[14:31:24] <inflatador>	 dcausse to clarify, I was thinking it was OK to reimage a host, even if it has a primary shard, as long as there's at least one replica, since the replica would be promoted
[14:35:21] <dcausse>	 inflatador: should be ttmserver & ttmserver-test, I'll bump the replicas to 2 if that's OK
[14:35:52] <inflatador>	 dcausse sure!
[14:36:48] <dcausse>	 inflatador: re primary shards on the banned host, this might be risky, have you encountered this scenario?
[14:41:31] <inflatador>	 dcausse yeah, so far it has been OK, similar situation as if we lost a node under normal situations
[14:41:45] <inflatador>	 but I would prefer to avoid if possible...that's why I'm looking closely at the ban stuff
[14:42:36] <inflatador>	 elastic2075 is one example of a host that's banned, but still has primary shards
[14:43:05] <inflatador>	 the other possibility is that my ban playbook doesn't work, but it works in cloudelastic and relforge so I don't think that's it
[14:48:39] <dcausse>	 the exclude._name has many nodes unrelated to the chi cluster, shouldn't matter tho
[14:49:42] <inflatador>	 yeah, I went with a shotgun approach to make the logic simpler...the ban cookbook does the same thing and it works as well
[14:50:41] <dcausse>	 with a primary shard on a banned host it's a lot trickier to check if it's sane to remove the host...
[14:52:13] <inflatador>	 I was thinking it was OK since replicas would get auto-promoted. But would it be dangerous if the replicas were out of date or something?
[14:52:44] <dcausse>	 inflatador: you run yellow or green during the migration?
[14:53:20] <dcausse>	 if you're yellow I'm not sure how you can determine if that primary shards is replicated or not
[14:53:27] <inflatador>	 yellow at the moment in chi and psi. There are 4 unassigned shards, all replicas 
[14:53:55] <dcausse>	 unassigned shards must be replicas or we're in trouble :)
[14:54:56] <inflatador>	 ;) for those shards, I've been checking we have at least one other replica
[16:00:59] <inflatador>	 Supposedly OS can't replicate to ES, but I'm seeing OS primaries w/ES replicas, for example arwiktionary_content_1728085605  shard 0 in psi. I wonder if there is a way to tell if those replicas are actually healthy?
[16:01:50] <inflatador>	 Not that it really matters since we aren't going to reimage the cirrussearch hosts, just kinda curious
[16:03:40] <inflatador>	 workout, back in ~40
[16:07:07] <inflatador>	 ryankemper just a heads-up, I'm reimaging cirrussearch2075 ATM
[16:12:19] <dcausse>	 I suppose the cluster could decide to allocate a primary to OS still preserving existing replicas on other es nodes, issue will arise when it tries to move that replica somewhere else
[16:47:16] <dcausse>	 inflatador: I wrote a crappy bash script that checks if primaries shards on banned host have at least a replica elsewhere on a non banned hosts it's in deployment.eqiad.wmnet:/home/dcausse/are_we_safe.sh
[16:47:53] <dcausse>	 it detects the tasks index which seems hosted on banned opensearch host 
[16:54:01] <dcausse>	 actually extended that to all shards not only primaries, because why not :)
[16:59:20] <inflatador>	 dcausse thanks, I was gonna make one of those too
[16:59:36] <inflatador>	 back
[17:01:51] <inflatador>	 There must be a way to check a shard's lucene version, that might help us understand the OS primary/ES replica thing too
[17:14:54] <dcausse>	 might be rather low-level if it exists, something at the lucene segment level, not sure that's exposed
[17:15:20] <dcausse>	 there's something: https://search.svc.codfw.wmnet:9243/$index_name/_segments
[17:19:01] <inflatador>	 interesting, `curl https://search.svc.codfw.wmnet:9643/arwiktionary_content_1728085605/_segments?pretty` gives me a couple of different lucene versions, 8.7.0 and 8.10.1
[17:21:26] <dcausse>	 8.10.1 must be on opensearch nodes
[17:22:22] <dcausse>	 could be that low traffic indices might not get any changes and no new 8.10.1 segments get created allowing replication to happen in the os->es direction
[17:25:01] <dcausse>	 dinner
[17:25:20] <inflatador>	 ryankemper cirrussearch2075 done, moving to cirrussearch21111
[17:27:24] <inflatador>	 this is our last row A non-master host, so we can try the cookbook again and probably roll with whatever row it chooses
[17:59:54] <inflatador>	 looks like 2111 might give us problems...seems to hanging at PXE. Will try upgrading firmware
[19:09:12] <inflatador>	 firmware upgrade didn't work, but using TFTP did. I guess EFI isn't a magic bullet for those R450 hosts
[20:08:50] <inflatador>	 ryankemper I was wrong, we have 1 more row A...reimaging 2091 now
[20:11:56] <inflatador>	 if you wanna try running rolling operation and just see what row it picks, we can start getting a puppet patch together. Cmd: ` test-cookbook -c 1135133 --no-sal-logging sre.elasticsearch.rolling-operation search_codfw "reimage next row" --reimage --task-id T388610 --nodes-per-run 3 --allow-yellow --wait-for-confirmation --start-datetime 2025-04-01T17:08:19
[20:11:56] <inflatador>	 `
[20:11:56] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[21:05:34] <inflatador>	 oh boy, Comm Error backplane 0 on cirrussearch2091