[00:09:23] I'll be out tomorrow AM my time, and off completely on Monday. Have a great weekend if I don't see ya! [08:44:24] hm, cloudelastic recovery stuck with "Format version is not supported" [08:45:45] and other usual constraints row awareness and disk threshold [08:49:09] dcausse: just to make sure: the public RDF stream is based on an HTTP stream, not one request per update? [08:50:22] gehel: it's one json line per-update on a long running HTTP (https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service) [08:50:35] great! thanks! [08:51:34] for cloudelastic perhaps we can relax the disk threshold a bit... [08:52:26] but yes seems like shards having been updated on a new node cannot move back to elastic [08:53:38] that seem unsurprising! [08:54:12] relax the disk threshold? allow the disks to fill up more? [08:54:16] yes [08:54:45] or should we just migrate everything to OpenSearch as fast as we can? [08:56:09] we should definitely do the migration quickly but we wanted to stay green on cloudelastic during the re-image given we're running without cloudelastic1008 [08:57:21] cloudelastic1009 still has plenty running shards... [08:57:48] In the previous elasticsearch upgrades, we've regularly ran into orange cluster state, due to the same thing about shards not being able to move from a newer version to an older version. So I'm not sure it is realistic to expect the cluster to stay green through the whole operation. [08:59:26] for cloudelastic removing a node while being yellow is a bit risky since we run with lower replication (1 replicas) [08:59:52] there are great chances to be stuck red since we wipe the disk out [09:02:25] Another use case for getting weighted tags per article: T379119 [09:02:26] T379119: [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119 [09:03:14] so for cloudelastic, we might need to add some way to check that we are safe to remove a node when we're in yellow state [09:05:42] so tempting to use the cirrusdoc api... :) [09:06:01] it works and there is nothing else, so I understand the temptation! [09:16:10] we should perhaps do some optimization anyways (avoid the sequential gets & propagate cdincludes to the source_filter) and perhaps be explicit on the doc that the content returned is not a stable format [09:23:41] update the watermark to 85%, 90%, 95% (low, high, flood) [09:27:08] things are moving out of cloudelastic1009 again... [09:33:05] making it clear that the doc is not stable seems like a good idea. I'll open a phab task [09:35:15] I have the vague impression that the recovery pool (8 parallel recoveries) is used by impossible recoveries from cloudelastic1007 (opensearch) to other nodes [09:37:52] impossible recoveries ? [09:42:48] yes... [09:43:32] not sure but seems like it tries to recover but then fail because of version mismatch so not like it's something it's taking into account when planning the recovery [09:45:31] sigh... it's because it recovers from the primary shard... [09:47:49] :/ [09:56:44] so if the banned node only has replica shards left I guess that's good enough? [09:59:45] status update on https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2025-03-07 [10:00:45] tempted to stop one elastic instance on cloudelastic1009 just to see [10:01:22] dcausse: worst case, you restart it! [10:03:55] did not go well... [10:04:53] we need to check master elligible nodes... [10:07:29] strange... cloudelastic1010 should have been picked-up... [10:09:52] we run discovery.zen.minimum_master_nodes: 1 and cloudelastic1010 would have been the last remaining master elligible... [10:14:01] dcausse: scream if you need help! [10:14:02] hm... discovery.zen.minimum_master_nodes: 1 is only on opensearch.yml [10:14:34] :( [10:15:24] "master not discovered or elected yet, an election requires at least 2 nodes with ids from" [10:16:44] Do we need to exclude those master elligible nodes after the reimage? [10:17:23] https://www.elastic.co/guide/en/elasticsearch/reference/current/add-elasticsearch-nodes.html#modules-discovery-removing-nodes [10:17:59] the new automatic management of master nodes is nice, but slightly too much magic! [10:18:58] I think we need to relax discovery.zen.minimum_master_nodes to one I believe it's set to two automatically since we haven't forced a value elasticsearch.yml (only in opensearch.yml) [10:19:12] hopefully it's settable as a transient setting [10:20:14] is that setting still valid? [10:21:39] "Elasticsearch (versions 7 and later) manages the size of the voting set of master nodes itself, so you don't have to worry about this any more" :( [10:23:25] cloudelastic1008 was master ellible so we run with only 2 now [10:23:38] which might not be enough... [10:24:09] unless we can force some rules or move that flag from cloudelastic1009 to another node [10:24:48] need to go out for lunch, invited at a retirement party of an old friend. Might be back later than usual this afternoon [10:47:46] dcausse enjoy! [11:37:06] looks like the mw train was a success this week, and our articlecountry change is now available [12:37:45] lunch+errand [14:18:38] gmodena: nice! [14:36:53] hm so for cloudelastic not sure what to do... [14:37:01] accept downtime and go without master during the re-image [14:37:49] quickly restart an instance to mark it as master eligible (will most likely go red during that restart) [14:40:27] I have a wip MR to increase data retention of query_clicks_daily https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1105 [14:41:17] still running some (data) validation [14:44:03] moved the master manually with /_cluster/voting_config_exclusions, stoping psi on cloudelastic1009 to see [14:44:49] worked, still yellow [14:45:17] will try for omega and chi [14:57:45] ok cloudelastic1009 is out [14:58:35] still yellow [15:01:37] I can keep it that way waiting for Brian to re-image [15:01:57] or restart the service, cloudelastic should have no shards [15:02:11] s/cloudelastic/cloudelastic1009 [15:02:21] added couple notes at https://etherpad.wikimedia.org/p/cloudelastic_opensearch [15:02:34] \o [15:02:36] o/ [15:02:42] looks like fun day :) [15:02:50] :) [15:03:58] it's been a while since I played with all this shard assignment stuff and I'm definitely rusty :) [15:04:15] it's changed a bit in 7 too, with the new concensus algo and a few apis [15:04:34] yes, just realized today that discovery.zen.minimum_master_nodes is no longer a thing :) [15:05:35] also that working with only 2 master eligible nodes is a bit messy [15:06:09] hmm, we should be able to go to 3 in this cluster? [15:06:31] i suppose in theory the new master handling should allow much more variety, but i haven't looked into how much [15:10:03] cloudelastic1008 was master eligible but it's gone [15:10:42] haven't found a way to promote another node as master eligible without a config change and a restart [15:11:03] which I wanted to avoid to not go red [15:12:54] ahh, yea that makes sense. I suppose we might make a ticket to review after the migration if we can simply have more masters now, maybe all of cloudelastic even [15:13:12] with more opensearch nodes entering the cluster we should have more flexibility [15:13:57] but now once a primary shard is promoted to cloudelastic1007 it cannot be recovered to another elastic node [15:14:40] and elastic is keeping trying to recover and failing because of lucene index version mismatch [15:14:46] it also can't be replicated to an elastic node it sounds like? [15:15:25] yes apparently low level index format changed so it's one way only -> cloudelastic [15:16:02] sounds like we need to do the big clusters restarts paying attention to spreading rows as well then [15:16:31] yes if row awareness is messing up with recoveries that won't help :/ [15:17:18] or maybe relaxing row awareness during cluster migration makes more sense ? [15:17:25] sure [15:17:57] or shift traffic to the spare so that we're ok running yellow and miss a few hot shards like enwiki [15:18:36] I mean running yellow for longer than usual [15:19:19] hmm, yea shuffling traffic i suppose would keep with things we usually do and know what to expect from [15:19:19] but yes dropping row awareness might ease things quite a lot [15:21:08] well... I'm not even sure... if on the first node restarted one enwiki shard is promoted as primary on opensearch then it'll have hard time recovering elsewhere [15:22:43] but should be a problem we've had in the past no? [15:22:43] i was thinking we often do 3 at a time, i guess we must be doing those in row based blocks already, but indeed it's a bit iffy at the beginning when things cant move [15:23:10] wel, not row based blocks, but keeping rows in mind and not doing all 3 in one row [15:23:39] i suppose we must have, lucene index versioning is certainly not now [15:23:41] new [16:47:03] * ebernhardson still can't figure out why intellij is not recognizing kotlin...considering kotlin was developed by jetbrains [17:32:33] * ebernhardson apparently now has a opensearch-1.3.20-analysis-sudachi-3.3.1-SNAPSHOT.zip ... how likely that it works? [17:39:43] ebernhardson: IIRC it needs a dictionary somewhere on the host so might not work out of the box [17:40:17] might need some tweak to the deb packaging thing to pull this out [17:41:55] oh ok, i hadn't looked into that at all yet [17:44:42] certainly finding it taking a minute to remember what to do...i guess a few months off does that [18:23:45] heading out, I've left cloudelastic in a weird state, cloudelastic1009 is excluded from the master election and I stopped all 3 elastic services there [18:24:13] inflatador, ryankemper: some notes https://etherpad.wikimedia.org/p/cloudelastic_opensearch if you plan to continue the rollout today [18:55:39] dcausse :eyes [19:18:54] also, looks like WMCS is getting alerts for hosts that have the new opensearch role, not sure why that is yet (ref T388270 ) [19:18:55] T388270: Update alerting to correspond with the new cloudsearch cluster - https://phabricator.wikimedia.org/T388270 [19:20:35] reimaging 1009 now [19:21:03] sadly no alerts there currently :P [19:21:17] i don't remember how to get historical ones, iirc it's not here but in logstash or grafana [19:22:46] It seems like an alert scope problem to me, but I'm not sure yet. I asked in IRC if they actually wanted to get alerts for cloudelastic [19:23:49] I know that if I was in WMCS, I wouldn't want to ;P [19:24:20] heh, yea i don't imagine they can do anything about those, it's just noise there [19:32:16] well, I'm not seeing anything on cloudelastic1009's serial console. I'll try the web UI, but I hope it's not the backplane again ;( [19:49:23] no errors logged, but the host is stuck at the BIOS screen. Rebooting... [19:52:43] so many fun random variations in hosts [19:56:39] Works 50% of the time, every time ;P [19:58:34] Trey314159: any suggestions how to test sudachi? In theory i have a .deb built, installed into our dev image, and running in my local dev env (copy of cindy's env). But other than checking that the settings configured sudachi i'm not sure what to look at [20:00:32] i suppose it seems to work, copying a Special:Random page from jawiki locally and then searching for some string...but that already worked before :P [20:05:20] hanging on boot again. I think I'm gonna try using EFI...Broadcom has discouraged us from us Ye Olde BIOS [20:07:04] using Ye Olde BIOS, that is [20:40:22] having any luck [20:50:31] about to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125520 and will give it another shot after that [21:04:10] * ebernhardson is wondering how much effot to put into the sudachi for 1.3...tempted to leave it fairly hackish and forked into gitlab instead of upstreaming [21:08:07] looks like we're still stuck on the PXE boot screen [21:10:25] :( [21:14:41] one more trick up my sleeve...gonna try with PXE using HTTP instead of TFTP. No, I'm not optimistic ;) [21:18:49] lol, yea sounds iffy. But can't hurt to try, probably [21:19:11] on the one hand i always felt tftp was pretty esoteric, but on the other hand we've been using it for a long time. probably works [21:19:29] Oddly enough, HTTP did the trick! [21:19:44] oh nifty [21:20:05] which is pretty funny, considering we had to fall back to TFTP to get it to work previously [21:56:24] Puppet's running, let's see if we can rejoin the cluster [22:12:49] It rejoined OK, but for some reason only psi has shards (I've already unbanned on all) [22:12:56] curious [22:13:46] best guess is that it gave up trying to route, since we were running with a single opensearch host for so long [22:14:22] hmm, yea that seems possible, although i still would have thought shards from other parts would shift in the balance things out [22:15:23] ryankemper just ran ` curl -XPOST 'localhost:9400/_cluster/reroute?retry_failed=true'` and that seems to have started some shards moving [22:49:22] Thanks everyone for your help! I'll be back Tuesday. Current state of migration in https://etherpad.wikimedia.org/p/cloudelastic_opensearch#L1