[13:05:24] wow TIL the eqiad cluster is now ~130 nodes [13:07:03] elukey: re-imaging during the upgrade is going to be very fun [13:07:09] for some definition of fun [13:07:38] yep I was reading Janis' summary on phab [13:10:06] akosiaris: tbh I think we should backport containerd 1.6 [13:10:15] oh? [13:10:17] that bad? [13:10:25] I mean, easy, it's a golang thing [13:10:41] in fact, it might help me avoid something I 've been trying to figure out [13:10:45] I think it's the easiest way out of the reimage misery [13:13:52] <_joe_> so we can just reimage at our own pace and not all at once? [13:13:57] if we have containerd 1.6 in bullseye, we could reimage node-by-node to bookworm while the cluster is pooled [13:14:02] yes [13:14:09] <_joe_> yeah seems sensible [13:14:21] same would be possible in the mixed version setup - but requires more work [13:14:28] <_joe_> there's one downside though [13:14:31] and mixed versions (k8s) [13:14:48] <_joe_> you'd need any puppet changes to clean up the old stuff [13:15:26] yeah..but I think thats manageable [13:15:34] compared to reimaging 200 nodes in a week [13:16:49] the 200 nodes in a week would be with a cluster depooled at the time? [13:17:15] yeah [13:17:21] like we did it last time basically [13:18:51] last time, it was 10 smaller and it required still a day [13:19:09] but we were mostly I think bottlenecked on reimages not being as parallel as we 'd like [13:23:09] yeah, even reimaging 4-6 nodes at a time, there are some steps (puppet run on alert hosts) that are single concurrency [13:23:25] so you have a variable wait for the downtime step [13:48:11] we thought about how we could probably exclude those puppet runs on alert hosts for this monster reimage, but it still feels way more flexible (to me) to not have to do it in one go [13:48:56] Agreed, being under pressure to get all reimages done in a set amount of time would really not be fun [13:52:07] it would not indeed. That being said, the switchover week is a pretty good week for doing some such chores [14:53:29] <_joe_> incredible how before we had all the automation yuvi and I could easily reimage 200 servers in 2 days. I wonder if things really need to be serialized that much [14:54:11] <_joe_> jayme: IMHO we should have a finite state machine for large batches of reimages which should only run puppet on the alert host once a batch is reimages [14:54:13] <_joe_> *d [14:54:25] <_joe_> but that's not what we're discussing here [14:54:45] <_joe_> porting containerd seems *by far* the best option