[09:14:22] elukey: we will need to verify on how to refresh all service account tokens when the signing key changes (also for moving "back" to using a pki cert for that again). I remember akosiaris and me having to deal with that long ago ... but I don't recall exactly what we did. It might have been "just delete all type=kubernetes.io/service-account-token and recreate pods that use them. akosiaris do you remeber by chance? [09:43:07] o/ [09:44:13] wity "when the sign key changes" you mean when we'll be using PKI and puppet rotates an expiring cert? [09:46:57] no, I mean when we move from from the cergen key to the pki key [09:54:34] ah okok [09:54:48] but what happens when the pki key rotates though? [09:55:14] we'll have the machinery in place in etcd etc.. [09:55:22] right nevermind [09:55:47] all right, if you want to proceed with the merge I am here to help and test on ml-staging-codfw [09:55:55] still need to rollout coredns and the rest [09:58:45] I'd like to get some confidence on the istio webhook issue before tbh [09:59:27] I'm a bit afraid there is something hiding in the dark [09:59:40] (because multiple control-planes [10:03:05] but there seems to be no issue atm right? [10:03:15] or do we have a way to repro? [10:03:55] Is magically started working indeed. But I did not investigate further [10:04:27] but it seemed as if ctrl 1002 was unable to connect to the webhook while 1001 was able to [10:04:59] or istiod hast not started the webhook properly before yesterday ~17:20Z [10:04:59] in theory when we'll be ready to switch to PKI we'll need to re-test all anyway in staging clusters, I am pretty sure that istio is not the only one that may raise problems with the multiple-control plane thing [10:05:41] oh, yes yes. But I do think this is something completely unrelated to service account tokens [10:06:00] as the error was a timeout [10:08:40] ok we can try to force an istiod pod kill and see if we can repro, maybe it comes up [10:09:24] but my point is - we are going to use a single TLS cert for account tokens, so whatever we test after it on 1.23 will not show problems that could emerge from multiple control planes [10:10:28] well...not from the token problems. But there might be others [10:10:36] but maybe you're right [10:10:56] we should probably "redo" aux and ml-staging [10:11:14] like in: whipe etcd and re-deploy [10:11:48] after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/889808/ [10:12:18] just to be on the safe side [10:12:33] we can do it, I need to tweak a little the cookbook since the downtimes don't last for long [10:12:36] gimme 10 mins [10:13:01] can the cookbook do whipe without any reimages? [10:13:07] of course [10:13:12] niiiice [10:13:23] it will reimage ctrl and workers though [10:13:31] ah [10:13:36] that's what I meant :) [10:13:48] how do you wipe them? :D [10:14:00] we don't [10:14:12] they don't store data - so who cares [10:14:29] sure then it is a single etcdctl command no? [10:14:30] I'd stop kube*, whipe etcd, start kube* [10:14:38] yep [10:15:05] ah no then we need to do it manually [10:15:07] and probably restart all kubelets and kube-proxy's to be sure [10:15:10] ack [10:15:20] it seems a good use case for another cookbook though [10:16:13] mmm wait maybe the upgrade cookbook supports it [10:16:39] we have options to skip control plane and workers [10:16:46] and IIUC in theory it should stop kube daemons [10:16:55] past Luca may have thought about it [10:17:32] :) [10:18:31] ah no sorry it doesn't work, if we skip-something it doesn't touch it [10:18:35] that makes sense [10:21:47] if you have patience I can write a new cookbook [10:22:01] shouldn't be too difficult keeping the current upgrade skeleton [10:22:38] you think we will ever need that again? [10:26:11] it could be handy for staging etc.. [10:28:51] elukey: yeah...you're probably right [10:29:09] I was trying to be lazy and not re-do staging completely 😇 [10:29:21] but it's probably the safest bet [10:29:23] gimme 10 mins :) [10:51:29] first draft in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/889956 [10:56:30] (refactoring one thing that will surely end up in a comment) [10:57:17] (and prospector hates me as always) [10:57:53] ready for a review [10:59:20] ah right it is missing one bit - since there are no reimages we need to force a puppet run [11:02:18] done [11:11:51] looking...I still wasn't able to figure out how to map the jwt key-id to the actual key used for signing :-/ [11:18:41] elukey: looks great! [11:30:21] fixed comments :) [11:32:06] great. I now figured the key ID out as well... [11:32:29] super :) [11:32:42] I am going out for lunch, if you need the cookbook merge + test it anytime :) [12:00:21] I've disabled puppet on wikikube and ml-staging masters, will merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/889808/ now [12:24:19] elukey: some small changes came out of dry-run https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/889972 [12:55:59] elukey: all good after re-deploying everything in aux [12:56:19] istio webhook came up after a couple of errors (like in wikikube staging)... [12:57:00] maybe I was wrong and it had something to do with the tokens after all and isitod was unable to launch the webhook server. I will try to find some logs to back this up [14:23:27] jayme: \o/ [14:26:28] I've reviewed the change, looks good! [14:27:19] I can wipe ml-staging-etcd after it is merged and deploy on it [14:29:02] elukey: cool. Uploaded the changes just about now [14:31:59] jayme: +1ed, I left a nit for the last ask_confirmation [14:32:21] I think that we could be more explicit so people will definitely wait before hitting "go" [14:34:37] elukey: took a shot. :) Feel free to merge if you're happy with it [14:35:19] looks perfect :) [14:35:31] going to wait for the merge and then I'll use it for ml-staging-codfw [14:35:45] cool [14:36:13] so the procedure that you used is: run puppet on ctrl + wipe [14:37:07] the procedure I used is run the cookbook and then helmfile apply all the things [14:37:18] oh, yeah...run puppet, lol [14:37:20] sorry [14:37:27] forgot I disabled it on ml [14:37:32] okok :) [15:04:39] ftr: I can't find any clue in the aux istiod logs from yesterday on why calling the webhook failed on 1002. All I see is the connection timeout but the webhook itself seems to have started properly (pod name was istiod-6c6575ffdd-v2bdn) [15:36:46] I have a problem with coredns [15:36:50] 0/4 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. [15:42:37] oh [15:42:47] I am a little puzzled though [15:42:58] how many replicas do you have? [15:43:43] 4 afaics [15:43:59] 2 are scheduled on the worker nodes, the other two show this behaviro [15:45:14] yeah, the anti-affinity probably tries not to schedule 2 pods onto the same node [15:46:11] otoh it does everywhere else ... [15:46:17] but on aux it works.. [15:47:16] ah interesting, on aux if I do kubect get pod -o yaml etc.. I don't see any affinity rule [15:47:30] meanwhile I see one for pods in ml-staging-codfw [15:47:31] yeah...I was wondering :) [15:47:47] you got the right chart version? [15:48:36] # TODO: Ideally we would spread CoreDNS pods across availability zones, but we [15:48:39] # will skip that step for now while on Ganeti hosts. [15:48:42] affinity: null [15:48:44] :D [15:48:54] this is why it works on aux :D [15:49:45] I can lower the replica count to 2 [15:50:21] yeah, that's what we do on wikikube as well [15:51:33] I also need to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889760 for istio/knative/kserve's settings [15:51:36] I'll add it :) [15:54:52] ah, nice. Let me quickly -1 you :-p [16:01:45] updated :)