[13:10:22] https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/ [13:30:13] <_joe_> interestingly, upgrading one cluster killed them? [13:38:11] they actually "killed" their calico route reflectors off in the process [14:23:13] For a deployment failure, immediately reverting is always “Plan A”, and we definitely considered this right off. But, dear Redditor… Kubernetes has no supported downgrade procedure. Because a number of schema and data migrations are performed automatically by Kubernetes during an upgrade, there’s no reverse path defined. Downgrades thus [14:23:14] require a restore from a backup and state reload! [14:23:24] I 've been thinking about this lately (again...) [14:24:16] especially during the eqiad wikikube upgrade. Which ok we could just wait until we found out whatever caused us issues, but I don't think our puppetization right now has an easy rollback path [14:24:45] on the plus side, we don't have to restore anything from backups as everything is in git [14:33:58] The nodeSelector and peerSelector for the route reflectors target the label `node-role.kubernetes.io/master`. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels. [14:34:02] ouch [14:34:41] I guess we might want to change some terminology in the next upgrade [14:35:16] wow [14:36:38] akosiaris: I think they did add the control-plane term a few versions before dropping master. at least with kubeadm there were several versions with both labels present [14:37:53] A really good write up, I think the note on inconsistency being the root of the problem is insightful. We should try to minimize inconsistency as more teams build out clusters. [14:38:13] taavi: 1.20 vs 1.24 I guess ? Per the writeup at least (we can double check, but I have no reason to doubt them) [14:38:40] yeah, something like that [14:38:49] anyway, kubemaster1001.eqiad.wmnet Ready master 15d v1.23.14 [14:38:49] kubemaster1002.eqiad.wmnet Ready master 15d v1.23.14 [14:39:02] we need to adapt it seems [14:39:03] note that we currently don't have any calico working as RR, and with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886329 + https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/887945 (the ASLoop part) the routers will work as route reflectors [14:39:43] yeah, we are about 1/5 of the size we need for RRs to become a necessity [14:39:54] they're already partially working as "RR" the changes above will allow them to do this work fully (cf. https://phabricator.wikimedia.org/T328523 ) [14:39:55] but I think we are going to be there soon next FY [14:40:25] what do you mean? [14:40:53] we are at 20 nodes IIRC, but the plan is to order quite a bit of nodes next FY [14:40:58] early next FY if possible [14:41:10] but for the part about needing RR? [14:41:26] the RR threshold, per calico is ~100 nodes IIRC [14:41:44] that's for the full mesh, we never did full mesh [14:42:19] <_joe_> XioNoX: we're gonna move a lot of physical nodes to k8s as we move mediawiki traffic [14:42:35] so this doesn't apply, but it is conceivable that around that size we 'll hit other issues (too many nodes peering with the core routers?) [14:43:31] akosiaris: we're far from the limits [14:43:57] <_joe_> XioNoX: say wikikube is ~ 150 physical nodes in december [14:44:00] XioNoX: not necessarily technical juniper limits. It's also about being able to make some sense out our infra [14:44:05] out of* [14:44:07] <_joe_> would that get to the actual limits [14:44:42] are you super excited with 150 wikikube nodes BGP peering with cr* in say 9 months from now ? [14:45:13] it's not inconceivable we will meet something we were not expecting, that's what I am alluding to [14:46:02] yeah, that's always possible [14:47:50] for BGP sessions we're still far from the limits I think it's in the 1000s [14:48:16] and that will also go down with L3 to the ToR [14:50:13] looking forward to see how that scales though! [14:51:18] <_joe_> us too! [14:53:05] good thing also we're not the first company to scale it that way :) [14:53:40] we don't really use the node-role labels for anything currently and they need to be managed manually as well, so for us this is a non issue rn [14:54:35] akosiaris: what do you mean that went wrong during the eiqad wikikube upgrade? [14:55:26] "rollback" in our case would have actually meant ro revert the puppet changes and reimage all masters and workers again. But there is no reason that should not have worked [15:11:30] jayme: never tested :P [15:11:38] and especially never done under pressure [15:12:59] you mean the reimage with old version was not tested? [15:13:09] "there is no reason" -> "there is no *known* reason" ? [15:15:36] <_joe_> the trick is to never actually be in a true hurry when you're upgrading [15:16:29] <_joe_> that's one of the core principles we've operated under since the start [15:16:55] <_joe_> because upgrading a k8s cluster "live" always felt to me like trying your luck [15:48:07] jayme: I mean that saying "oh crap, we need to abort and rollback" hasn't even been fleshed out as a thing, never mind tested [15:48:43] in theory, we can figure it out. Revert a few puppet patches, a few admin_ng patches, reimage everything, redeploy everything etc [15:48:49] have we ever done it, even once? [16:01:01] No. We did not [17:44:31] Speaking of rolling back k8s...I guess everyone's probably read this, but posting anyway https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/ [17:47:14] that's where the conversation started from :-P [17:48:43] interesting. I guess my bouncer missed that ;)