[09:40:39] errand+lunch [13:12:02] Looks like we don't have anything to report this week... https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-04-11 [13:38:07] o/ [13:38:25] o/ [14:58:10] starting a new batch of CODFW row D hosts [15:25:15] ryankemper I got cirrussearch2114 to reimage without doing anything differently. One thing I noticed is that the cookbook doesn't seem to find a successful puppet run until I manually login to the new host and run it [15:31:50] Time to start the weekend! Enjoy! [15:32:30] * inflatador is wondering if there's a way to set manually set a replica as UNASSIGNED/ALLOCATION FAILED. It would speed up the reimages since they cookbook (correctly) blocks on shards that are initializing [15:32:37] .o/ [15:39:35] hmm, as soon as I wrote that, one of the shards actually moved [16:17:06] well, now we're running in to the bug where `--allow-yellow` waits for yellow status and no moving shards and doesn't accept green as valid status...it looks like if the first cluster picked is green it will spin forever [16:18:24] we're back to green on chi and omega, I guess I'll manually pick a psi host to reimage and hope that gets us back to green [16:20:05] ` sudo cumin 'P{P:netbox::host%location ~ "D.*codfw"} and A:cirrussearch-codfw-psi'` tells me I should pick elastic2085 [16:51:13] having issues with a mislabeled ethernet port in CODFW, dc ops is helping [17:45:11] OK , 2085 is done and it's joined the cluster. That should unblock the next batch of row D hosts [18:47:07] ryankemper This is pretty slow going. I'm gonna temporarily bump up cluster.routing.allocation.node_concurrent_recoveries from 4 to 8, that should help new nodes receive shards a bit faster. LMK if you prefer not to do this [19:17:53] inflatador: sure [19:20:54] ryankemper cool, just bumped it up, I can see more shards going to the last reimaged host (cirrussearch2085) . We'll see if it makes a difference [19:27:57] We've got 4 shards in `INITIALIZING ALLOCATION_FAILED`. I thought the cookbook would be OK w/that so long as we did --allow-yellow, but apparently not ;( [19:28:12] `Error while waiting for yellow with no initializing or relocating shards` [19:28:28] Oh, I guess we have to wait for those to go to FAILED state [19:40:58] ryankemper elastic2105 is safe to rename/image if you wanna do a one-off. I'm doing 2104. I'm looking at the unassigned shards to figure out which are safe [19:42:02] cmds like `sudo cookbook sre.hosts.rename -t T388610 elastic2085 cirrussearch2085` for rename and `sudo cookbook sre.hosts.reimage --os bullseye --move-vlan --new cirrussearch1234 -t T388610 [19:42:02] ` for reimage [19:42:02] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [19:43:56] inflatador: sure i'll kick off 2105 [19:48:40] Thanks, hopefully we can get a few more done before the weekend ;) [19:54:41] well damn, I accidentally renamed elastic2104 to cirrussearch2014 ...trying to rename again but the cookbook says hosts not found [19:56:30] gonna try putting the new hostname in Puppet to see if it makes a difference [20:00:47] inflatador: missing preceding / in your patch [20:01:22] ryankemper ACK, fixing now [20:04:29] So, I was looking at a query in Portuguese that was searched on French Wikipedia. I tested the query on English Wikipedia, and the top two snippets are in Spanish. My brain hurts. [20:05:08] Too much Romance can hurt your brain? [20:07:23] It's le whiplash.. each time I primed my poor little brain for one Romance language, it got slapped upside the head with a different one! [20:20:03] I dunno what to do with this cirrussearch2014 thing...the rename cookbook still says ` self.remote_host = spicerack.remote().query(self.old_fqdn)` [20:20:37] ryankemper what do you think? Should I give up and move to the next host, or make some patches so we can use the host with its incorrect name? [20:23:40] inflatador: what's wrong exactly, it won't let you rename it? [20:23:57] if so we need to reimage it to get puppet running and then we can rename it again [20:25:13] ryankemper good idea, let me try that [20:25:20] and yes, it won't let me rename it [20:25:45] yeah it needs to be in puppet for that query to work [20:26:12] it's in the puppet repo, but I guess it needs to actually be in PuppetDB? [20:32:37] yup [20:32:41] cause of the `spicerack.remote().query(self.old_fqdn)` call [20:42:06] we're getting a few crashloop errors for the logstash process on elastic1053, checking it now [20:44:39] hmm, got the same alert for 1057. Both cleared when I restarted logstash...not sure what happened there [20:59:36] damn, looks like I have to add 2014 to regex.yaml as well...one sec