[13:10:22] <taavi>	 https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/
[13:30:13] <_joe_>	 interestingly, upgrading one cluster killed them?
[13:38:11] <jayme>	 they actually "killed" their calico route reflectors off in the process
[14:23:13] <akosiaris>	 For a deployment failure, immediately reverting is always “Plan A”, and we definitely considered this right off. But, dear Redditor… Kubernetes has no supported downgrade procedure. Because a number of schema and data migrations are performed automatically by Kubernetes during an upgrade, there’s no reverse path defined. Downgrades thus
[14:23:14] <akosiaris>	 require a restore from a backup and state reload!
[14:23:24] <akosiaris>	 I 've been thinking about this lately (again...) 
[14:24:16] <akosiaris>	 especially during the eqiad wikikube upgrade. Which ok we could just wait until we found out whatever caused us issues, but I don't think our puppetization right now has an easy rollback path
[14:24:45] <akosiaris>	 on the plus side, we don't have to restore anything from backups as everything is in git 
[14:33:58] <akosiaris>	 The nodeSelector and peerSelector for the route reflectors target the label `node-role.kubernetes.io/master`. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.
[14:34:02] <akosiaris>	 ouch
[14:34:41] <akosiaris>	 I guess we might want to change some terminology in the next upgrade
[14:35:16] <XioNoX>	 wow
[14:36:38] <taavi>	 akosiaris: I think they did add the control-plane term a few versions before dropping master. at least with kubeadm there were several versions with both labels present
[14:37:53] <jhathaway>	 A really good write up, I think the note on inconsistency being the root of the problem is insightful. We should try to minimize inconsistency as more teams build out clusters.
[14:38:13] <akosiaris>	 taavi: 1.20 vs 1.24 I guess ? Per the writeup at least (we can double check, but I have no reason to doubt them)
[14:38:40] <taavi>	 yeah, something like that
[14:38:49] <akosiaris>	 anyway, kubemaster1001.eqiad.wmnet   Ready    master   15d   v1.23.14
[14:38:49] <akosiaris>	 kubemaster1002.eqiad.wmnet   Ready    master   15d   v1.23.14
[14:39:02] <akosiaris>	 we need to adapt it seems
[14:39:03] <XioNoX>	 note that we currently don't have any calico working as RR, and with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886329 +  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/887945 (the ASLoop part) the routers will work as route reflectors
[14:39:43] <akosiaris>	 yeah, we are about 1/5 of the size we need for RRs to become a necessity
[14:39:54] <XioNoX>	 they're already partially working as "RR" the changes above will allow them to do this work fully (cf. https://phabricator.wikimedia.org/T328523 )
[14:39:55] <akosiaris>	 but I think we are going to be there soon next FY
[14:40:25] <XioNoX>	 what do you mean?
[14:40:53] <akosiaris>	 we are at 20 nodes IIRC, but the plan is to order quite a bit of nodes next FY
[14:40:58] <akosiaris>	 early next FY if possible
[14:41:10] <XioNoX>	 but for the part about needing RR?
[14:41:26] <akosiaris>	 the RR threshold, per calico is ~100 nodes IIRC
[14:41:44] <akosiaris>	 that's for the full mesh, we never did full mesh
[14:42:19] <_joe_>	 XioNoX: we're gonna move a lot of physical nodes to k8s as we move mediawiki traffic
[14:42:35] <akosiaris>	 so this doesn't apply, but it is conceivable that around that size we 'll hit other issues (too many nodes peering with the core routers?)
[14:43:31] <XioNoX>	 akosiaris: we're far from the limits
[14:43:57] <_joe_>	 XioNoX: say wikikube is ~ 150 physical nodes in december
[14:44:00] <akosiaris>	 XioNoX: not necessarily technical juniper limits. It's also about being able to make some sense out our infra 
[14:44:05] <akosiaris>	 out of*
[14:44:07] <_joe_>	 would that get to the actual limits
[14:44:42] <akosiaris>	 are you super excited with 150 wikikube nodes BGP peering with cr* in say 9 months from now ? 
[14:45:13] <akosiaris>	 it's not inconceivable we will meet something we were not expecting, that's what I am alluding to
[14:46:02] <XioNoX>	 yeah, that's always possible
[14:47:50] <XioNoX>	 for BGP sessions we're still far from the limits I think it's in the 1000s
[14:48:16] <XioNoX>	 and that will also go down with L3 to the ToR
[14:50:13] <XioNoX>	 looking forward to see how that scales though!
[14:51:18] <_joe_>	 us too!
[14:53:05] <XioNoX>	 good thing also we're not the first company to scale it that way :)
[14:53:40] <jayme>	 we don't really use the node-role labels for anything currently and they need to be managed manually as well, so for us this is a non issue rn
[14:54:35] <jayme>	 akosiaris: what do you mean that went wrong during the eiqad wikikube upgrade?
[14:55:26] <jayme>	 "rollback" in our case would have actually meant ro revert the puppet changes and reimage all masters and workers again. But there is no reason that should not have worked
[15:11:30] <akosiaris>	 jayme: never tested :P
[15:11:38] <akosiaris>	 and especially never done under pressure
[15:12:59] <jayme>	 you mean the reimage with old version was not tested?
[15:13:09] <XioNoX>	 "there is no reason" -> "there is no *known* reason" ?
[15:15:36] <_joe_>	 the trick is to never actually be in a true hurry when you're upgrading
[15:16:29] <_joe_>	 that's one of the core principles we've operated under since the start
[15:16:55] <_joe_>	 because upgrading a k8s cluster "live" always felt to me like trying your luck
[15:48:07] <akosiaris>	 jayme: I mean that saying "oh crap, we need to abort and rollback" hasn't even been fleshed out as a thing, never mind tested
[15:48:43] <akosiaris>	 in theory, we can figure it out. Revert a few puppet patches, a few admin_ng patches, reimage everything, redeploy everything etc
[15:48:49] <akosiaris>	 have we ever done it, even once? 
[16:01:01] <jayme>	 No. We did not
[17:44:31] <inflatador>	 Speaking of rolling back k8s...I guess everyone's probably read this, but posting anyway https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/
[17:47:14] <taavi>	 that's where the conversation started from :-P
[17:48:43] <inflatador>	 interesting. I guess my bouncer missed that ;)