[07:53:14] Got a quick appt tmrw morning, will be ~30ish mins late to weds meeting [10:04:59] lunch [12:09:49] lunch [12:34:28] greetings! [12:59:44] o/ [13:21:20] and welcome back [13:22:25] thanks! [16:36:25] back [16:50:48] will not make unmtg today, have to cook for kids [17:47:09] lunch, back in ~45 [18:36:08] back [19:24:51] ebernhardson and ryankemper , wanted to talk about https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L257 today. Mainly I want to figure out whether invoking the reimage cookbook on multiple hosts at the same time will work, as the reimage cookbook itself has prompts [19:25:15] Will pop open a Google Meet at our normal time (2 PM PDT/4 PM CDT) if that works for y'all [19:25:40] errr.should be the reimage option in the rolling-operation cookbook [20:06:45] hmm, looking [20:06:52] (forgot to mention lunch, back now) [20:09:19] inflatador: maybe a bad idea, but could we avoid the issue by invoking it for different sets of hosts in multiple tmux panes? Like have 2 parallel invocations that each do half of the cluster? [20:10:15] hmm, i guess this doesn't really support that [20:10:39] Possibly? I just want to avoid having to pack a lot of state management into the cookbook, esp. considering the reimage cookbook errs a lot (not really its fault, doing a lot of calls to fiddly OOB stuff like DRACs) [20:12:05] I filed a task today to allow targeting hosts for rolling-operation, that could make it more manageable without as much work https://phabricator.wikimedia.org/T312991 [20:12:06] i'm thinking probably not because it uses get_next_clusters_nodes() which in turn sources from the elasticsearch api's [20:12:32] (re multiple invocations) [20:13:36] yeah...the host targeting isn't foolproof either, but I'm guessing it could piggyback on the existing logic we have around anti-affinity, cluster status etc [20:20:19] i'm not really seeing any nice ways, the `nodes = self.elasticsearch_clusters.get_next_clusters_nodes(self.start_datetime, self.nodes_per_run)` line ties us pretty tightly to accepting whatever is decided there. Hacky methods include monkey-patching that method but people aren't usually a fan :) [20:21:09] * ryankemper is almost done re-setting up his desk [20:21:44] * ryankemper catches up on backlog [20:22:38] ebernhardson could we do it if we added some more logic to spicerack, or are we forced to monkey-patch Elastic's python libraries? [20:22:44] i'm not sure about the part about running multiple reimages in parallel from the same cookbook, but i don't think my idea of invoking the rolling cookbook from separate shells is doable [20:23:05] inflatador: certainly more logic to spicerack would do it, it's not an issue of elastic's python library [20:23:18] i need to think for a sec but I don't fully understand the concern with the multiple hosts currently [20:23:29] currently the cookbook is calling the reimage individually for each host, right? [20:23:33] inflatador: it would just be the spicerack part, something like making spicerack.elasticsearch_cluster.ElasticsearchClusters._get_nodes_group operate on a limited set of nodes instead of everything the cluster returns [20:24:17] possibly bad idea: allow passing a regex against hostnames that is applied in _get_nodes_group [20:26:40] ryankemper reimage cookbook is interactive (prompts for DRAC pw), wondering if we handle that OK with multiple hosts at once. Also worried about what happens if one image fails and another succeeds, does it bail out or keep operating? What state are the failed hosts in? [20:27:00] it's entirely possible these questions are answered already and I just don't remember [20:27:02] cookbook says it keeps going, which sounds dangerous :) [20:27:27] but i suppose it will wait for green, so if it gets bad enough it can't get out of yellow it will just stall [20:27:30] (1) IMO any concerns about reimages failing are already handled by the general rolling operation logic [20:27:40] (erik beat me to it) [20:28:17] now that doesn't cover the annoyance of how the cookbook will know to re-re-image on subsequent runs, but we at least have a good guarantee that the cluster won't go past yellow [20:28:58] we will have to add some logic to use the "--new" flag on failed reimages, or do them manually [20:29:04] (2) The interactive stuff is a bit thorny. I think we might want to pass the mgmt pw into stdin at the beginning of the cookbook run and have it store that in memory to supply for subsequent ones [20:29:40] spicerack already caches the mgmt pwd, if you call spicerack.run_cookbook() it should not ask it again (AFAI remember) [20:30:01] volans: ah thanks. I was going to say I thought I remembered it not being a problem and that would explain why [20:30:55] if you call spicerack.management_password (just as a useless statement) you can force when it will ask for it if you need it to ask it earlier than when it's currently needed [20:31:08] https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.management_password [20:31:37] inflatador: the retrying failed ones is definitely something to figure out though. I think the two options are either we manually run the sre.hosts.reimage cookbook with --new on failed nodes, or alternatively we can add some recovery logic to the rolling image. The only tricky thing there is not all failure conditions actually need the --new, but maybe we could use the presence/absence in puppetdb to decide that? unsure [20:32:24] I doubt we'll get a solution that's totally hands off but ideally we'd have a solution where we only need to manually intervene a few times per cluster [20:35:48] I'll take a look at the return values of the reimage cookbook and see how granular they are. like are we basically just getting zero vs non-zero exit code or is there enough granularity to be able to inform the cookbook about whether it should use the --new flag or not [20:36:41] Looks like reimages take around 30m, and we have about 70 hosts in all [20:37:10] give or take a few that we've manually reimaged already [20:37:26] https://github.com/wikimedia/operations-cookbooks/blob/54cd9f94c3ecb66dd7c18b1ba536bc764f6a2789/cookbooks/sre/hosts/reimage.py#L574-L577 yeah we just get 0 or 1, which makes sense [20:37:28] ryankemper cool yeh, let me know [20:39:35] We should be pretty safe to just assume we want --new when retrying failures, since the reimage cookbook checks the puppetdb for us: https://github.com/wikimedia/operations-cookbooks/blob/54cd9f94c3ecb66dd7c18b1ba536bc764f6a2789/cookbooks/sre/hosts/reimage.py#L126-L138 [20:39:45] but if we incorrectly supply the --new it will pause on input from us [20:40:17] Also it's a bit hairy because generally we don't keep much in-memory state in the reimage cookbook, like if we fully aborted the rolling reimage and then kicked it off again the cookbook wouldn't have a way to remember that [20:42:12] we have --start-datetime , would that be enough though? [20:45:05] hmm, make that 45m for a reimage [20:45:57] my remedial math skills say that it will take ~53h to reimage our hosts at that rate [20:46:17] yea, thats painful :S [20:46:48] and with failures, add a couple days [20:48:02] Yeah...maybe cut it in half it we do run 1 from eqiad and 1 from codfw at the same time [20:48:47] We'll want to stick with one cluster at a time IMO [20:48:55] We need to always have a green cluster to cut over to when necessary [20:50:11] The --start-datetime mostly will work, the only failure condition I can think of at first glance would be the reimage failing early enough that the host doesn't get reimaged at all and just comes back up with the old OS [20:51:06] Is a green cluster a hard requirement? I feel like we can have both clusters in yellow at once, so long as we're only doing 1 host/dc at a time [20:51:12] for any failure condition that results in the host not coming back, the --start-datetime would work (but the cookbook won't know to supply the `--new` flag unless it has that in memory, or if we add logic to our rolling reimage operation to check for the puppet db [which is not super optimal because the reimage cookbook itself does that check too so we'd be repeating ourselves] [20:52:00] inflatador: So we can safely cut over to a yellow cluster, but it just increases the risk. In particular if a cluster is already yellow and we switch over traffic then theoretically the traffic could result in a host getting toppled although I'm not sure how likely it is [20:52:09] I don't have hard opposition to doing both clusters at once but my gut tells me it's not the best idea [20:52:47] s/safely/semi-safely [20:54:06] yeah, I'd prefer not to, but cutting the amount of work in half is maybe worth the risk? Probably want to ask Erik, David and MrG (and maybe other team members) their opinion [20:54:31] Yeah, we can discuss more in pairing today [20:54:55] on the other hand, it's mostly a babysitting job, so it's not like that 53 hours requires complete attention 100% of the time [20:55:07] MrG's back on monday IIRC so we can def consult with him [20:55:35] yeah, it's more about how quickly we want to push everything fwd, that's more of a manager decision anyway [20:56:27] If we're ready to start doing reimages tomorrow then let's start with one cluster and we'll discuss whether we want to parallelize to 2 on Monday [20:56:38] But depending on how much logic we want to add before proceeding then it might take till monday anyway [20:58:11] inflatador: gonna eat some food real quick, I'll be ready in 15 mins [20:58:40] ACK, will wait until then to put up the Meet [21:16:53] OK, up at https://meet.google.com/euw-yyev-fsn whenever y'all want to join