[09:31:03] a bit hesitant to re-process an old ores_predictions_hourly dag run... risk of erasing newer predictions by pushing older ones or doing nothing and having to wait for new edits to fix them [11:42:03] lunch [14:13:23] o/ [14:29:40] inflatador, ryankemper: I'd go ahead and migrate the main elastic role to Puppet 7, unless it's currently a bad time? [14:30:18] moritzm Now's a good time! Thanks for the heads up [14:30:50] excellent, I'll ping the channel when I'm done. given the size of the cluster the cookbook will run 20-30m [14:32:54] ACK, sounds good [15:18:54] the cookbook ran into a not-yet-seen-before traceback of puppetserver, which disrupted the cookbook, I'm replaying the steps it missed , but it's going to take a bit longer [15:19:38] but elastic1053 e.g. is the first node fully migrated (tested the missing steps with it) [15:36:17] inflatador: hey, we have 2 wdqs servers in rack A2 which we're working in today [15:36:37] wdqs2013 and wdqs2023. are we ok to proceed do you know? [15:38:50] topranks Y, go ahead...sorry I missed those [15:39:06] ok that's great, thanks! [15:42:27] all elastic::cirrus are on Puppet 7 now. that was a lot more exciting than I had wished for, but should all be fine now [15:42:37] let me know if you run into any issues [15:51:55] moritzm excellent, thanks for your help [15:58:05] moritzm: I missed the excitement :(. But thanks for moving this forward! [16:02:35] specifically caused by https://phabricator.wikimedia.org/T349619#9521584 a decom of a server made puppetserver emit an internal server error which disrupted the cookbook [16:03:06] obviously with 1524 hosts migrated this would only show up when moving our biggest cluster :-) [16:04:03] Something to remember for the Puppet 8 migrations? ;P [16:04:37] I hope not! the whole process to move from Puppet 5 to 7 is a bit like "The Wages of Fear" and 7-8 should be "just" a normal update [16:13:35] inflatador: everything moved ok, however we are having a little issue with the SFP module for elastic2038 but working on it [16:14:18] topranks ACK, I saw that in dc ops channel. As you said, no rush on that one [16:14:38] ok cool thanks [16:15:24] yeah just mentioned that, in my experience being flustered with something like that doesn't help, need to stay zen :) [16:22:57] inflatador: can we reboot elastic2038? not 100% sure what is up, the cabling is fixed but the OS is not registering the 10G NIC all of a sudden [16:23:17] (the idrac system mgmt does show the NIC as present however, and switch has a working link physically) [16:23:33] I was going to suggest a power down, reseat/inspect the NIC, then power back on? [16:25:58] topranks sure. It's all yours, we don't need it for the short term ;) [16:26:06] cool [16:26:11] honestly, we could decom it if it's too stubborn [16:27:17] it's about ~6mo past refresh date ;( [16:33:01] ha ok [16:33:24] well it's back up now, reseat did the job, likely the card was a little loose and got shifted when cable was replaced [16:34:50] inflatador: I noticed while watching the boot messages for elastic2038 that the prometheus-wmf-elasticsearch-exporter-9[2|4]00 services didn't start [16:35:01] not sure if that matters I see related timers but figured I'd mention [16:35:25] topranks no worries, I usually run puppet a few times when that happens, eventually it gets sorted [16:35:38] cool good to know [16:35:59] anyway all back up now so you can repool things [16:42:31] excellent, thanks [16:55:06] workout, back in ~40 [17:49:21] back [18:00:35] brb [18:31:43] sorry, been back [18:48:36] CR for cloudelastic1008 migration if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/998494 [18:48:46] taking lunch, back in ~40 [19:30:37] back [19:30:50] nm on the CR above, b-tullis is revieing [19:30:54] or reviewing [21:33:45] cloudelastic is red again...checking out now [21:36:01] wikidatawiki_content_1692131793 is unassigned/primary failed. Probably from me not banning 08 before reimage ;( [21:36:24] hmm, there should have been at least two copies, it would have to have been yellow when taking the node out [21:36:30] but maybe [21:36:54] it's shard 19, both P and R are shown as unassigned [21:37:11] hmm, yea if there are no nodes to bring into the cluster that will be a problem [21:37:28] well, the recovery i suppose is snapshot from eqiad to swift, restore into cloudelastic? [21:38:03] Y...I was hoping it wouldn't come to that but I guess that's on me [21:38:25] I don't remember it being yellow but maybe I missed that [21:39:37] anyway, starting the snapshot procedure [21:42:48] poking at the logs, but not sure yet. There are too many :P [21:43:57] snapshot is going: https://etherpad.wikimedia.org/p/cloudelastic-restore [21:44:31] probably shoulda used eqiad [21:44:48] hmm, so at 19:39 shard 19 failed with cloudelatic1003 disconnected. at 19:43 cloudelastic1008 left the cluster and took the other copy [21:44:59] suggests 1003 should have it? [21:46:09] i dunno...either way yea have to snapshot recovery [21:46:10] Y, it looks like it's still in the cluster...would a force reallocate work? [21:47:03] it looks like 1003 started a recovery with 1008 but it failed to finish [21:47:46] seems like an awkward edge case, but it seems 1003 failed to recover from the only available replica so didn't promote the shard it had to primary [21:48:12] i suppose in terms of correctness, it knew the data it has was out of date [21:49:43] yeah, it'd be faster to recover using the replica data, but then I yanked the cord on the primary [21:53:30] As far as the writes we've missed/will miss, will the Sanetizer fix that eventually? Or will that have to wait until we come up with the new process? [21:54:11] so fun thing, wikidatawiki is currently recieving writes from SUP and not cirrus, so no :P But in this case i think we can flip that around and turn cirrus writes back on for wikidatawiki [21:54:52] i mentioned in the wed meeting earlier, but i think recognizing we don't have the necessary procedures for SUP is a good call to pause rolling it out further (or in this case even backing up a little bit) and get that process into place [21:56:20] point taken [22:08:30] turning writes back on for wikidatawiki on cloudelastic now (it might complain a little about writes failing until the snapshot is restored i suppose) [22:12:50] this is for commons though, right? Not wikidata? [22:14:11] says wikidatawiki_content: curl -s https://cloudelastic.wikimedia.org:9243/_cat/shards | grep UNASSIGNED [22:17:07] i suppose this will also give us some info about what the SUP does when writing to a red index :) [22:17:53] i'm assuming they get bucketed into the _FAILURE numbers, but not seeing any yet [22:19:17] oh, the taskmanager dies apparently. Only seeing a jobmanager right now for consumer-cloudelastic [22:21:40] looks like it died with the message 'FlinkRuntimeException: Complete bulk has failed.' in the sink writer BulkListener afterBulk handler [22:22:10] * ebernhardson writes a ticket :P [22:23:51] accidentally snapped commonswiki, just sent a new snapshot request to eqiad for the correct index: https://etherpad.wikimedia.org/p/cloudelastic-restore [22:30:41] curiously, it looks like the actual problem with SUP falling over is mis-aligned timeouts. SUP didn't get an error response from elasticsearch, rather it timed out the request after 30s [22:30:53] or part of the problem at least [22:34:17] that's not great [22:36:07] i'm not sure why a red status would cause the writes to suddenly take a long time though. I'm not seeing anything that says elasic will delay writes under a red condition, everything says they are rejected. I guess i expected the bulk response to simply have ShardNotFound or some similar error in the response [22:38:25] also i didn't know this, but apparently elastic has an `allocate_stale_primary` option that would try and use an old copy of the data from disk [22:39:31] i guess it couldn't hurt to see what it does, although it might be a bit late [22:40:59] agreed...can that be changed without restarting the cluster? [22:42:14] you did something ;) [22:42:25] inflatador: hmm, yea it seems to have worked. Sec i'll phab paste it [22:43:26] inflatador: https://phabricator.wikimedia.org/P56486 [22:43:37] amusingly, you have to add that "accept_data_loss: true" flag or it wont work :) [22:44:34] ACK, will add to our Search page shortly [22:44:44] so that shard is now back, cluster is yellow, but it has old data. I guess there is a lesson here that elasticsearch really does try to not delete your data :) [22:45:15] with cirrus writes turned back on we can use the normal outage recovery from wt:Search to replay the writes [22:46:15] Is that change active already? If so, happy to run the PHP script [22:46:25] yea i deployed that about half an hour ago :) [22:48:01] Cool, will hit the script shortly and update [22:53:01] for $wiki var in the script, is it basically just the output of _cat/indices? [22:53:47] inflatador: it's the wiki id (aka db name). So in this case it's wikidatawiki [22:54:28] i think over the years people have been trying to migrate from wikiids to partial domain names (ex: it.wikipedia) but wiki id's are prevalent [22:54:58] besides, dbnames followed the rules so we get silly names like mediawikiwiki and wikidatawiki [22:56:37] ebernhardson was thinking that if the cluster was red, no writes would get thru? Or it is only that specific index? [22:56:53] inflatador: should be only that specific shard, writes that land on other shards should have gone through [22:57:23] queries would have failed because they query all shards and not all were available, but writes should have only failed if they couldn't route to a shard [22:57:31] but i say should, i haven't tested this extensively :P [22:58:03] good to know...also good to know that wiki id != index name [22:58:51] it is the prefix though, we set index_base_name = wiki_id, and then the index_base_name is prefixed to everything. So you get `wikidatawiki_content_123456789` as the index name, then aliases from `wikidatawiki` and `wikidatawiki_content` point at that index [23:04:28] oh cool, they changed the capitalization of ForceSearchIndex.php ;( [23:09:42] there's also a wikidatawiki_general ? [23:10:30] yes, mediawiki has the concept of content pages vs everything else. We put articles (or for wikidata, q items) in the _content index, and everything else in a _general index [23:10:56] this makes searching for content more efficient (smaller indices) and more accurate (statistical language model better reflects the content) [23:23:30] oh yeah, **everything** has a general