[08:54:57] cdanis: ack. Please share how it went :-) [09:04:58] jayme: I am progressing the cookbook, almost done (two worker nodes were left, now it is finishing the last one) [09:05:14] weird thing - kubernetes-client (and hence kubectl) was not installed on aux ctrl nodes [09:05:15] elukey: nice! [09:05:23] (so the verification step failed) [09:05:24] not so nice [09:05:30] the rest looks fine though [09:06:34] you say on aux ctrl nodes... on ml-staging ctrl nodes it got installed? [09:06:43] yes [09:06:50] just to add more weirdness [09:06:59] hmpf [09:08:32] I'll take a look [09:09:26] ml-staging-ctrl* still got 1.20.5+really1.20.2-1 though...:/ [09:10:04] ah, you've not reimaged them after I merged the packaging patch [09:10:41] akosiaris: o/ IIRC you did some etcd reimages with confXXXX nodes, I have a question about etcd. I naively thought that if a node is reimaged then it auto recovers when it starts again (so it asks for the raft log from other nodes and rejoin the cluster). What I found from my tests in staging is that the reimaged node refuses to restart, since the raft log is "behind" the rest of the cluster [09:10:47] (since it is empty). The k8s upgrade cookbook uses the "one-node-at-the-time" etcd reimage, but to make it work I have to (manually) remove the reimaged node via etcdctl and re-add it (to allow it to appear as new and ask the raft log from other nodes). Is there anything that I am missing? [09:11:07] jayme: I am doing it this morning, if the etcd cluster allows me :D [09:11:28] elukey: ack [09:12:54] dcausse: o/ will you be around on 2023-02-21 to assist with properly starting and stopping rdf-streaming-updater in codfw? [09:13:25] properly as in "the tested way" :) [09:13:32] jayme: o/ yes I should be here! :) [09:13:35] https://phabricator.wikimedia.org/T329664 [09:13:39] dcausse: cool [09:18:23] made that explicit in the action item list [09:27:19] elukey: role::aux_k8s::master does not include profile::kubernetes::client - so all good [09:32:47] jayme: ah nice! Didn't see it.. Should it have it? [09:32:58] anyway, aux's cookbook run completed! [09:33:51] yay [09:34:38] it's not really required on masters tbh but I think it's nice to have it around just in case (like deployment server failing - whatever) [09:34:42] it should not hurt [09:35:26] we had a comment in some of the roles that it is not required, though. I think that's why it was left out when creating aux [09:35:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/889486 [09:39:20] +1ed [09:41:19] ahah, pcc Hosts: auto choose to not do aux ;p [09:45:08] ...where it failed for different reasons [10:04:44] jayme: just to confirm, https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Clusters/New#Label_Kubernetes_Masters is still the same on 1.23 right? [10:05:08] yeah, no change there [10:06:11] and what about the worker nodes? I was trying to figure out what to do for the aux ones (ganeti vms) [10:11:31] ah lovely they are already added [10:11:34] perfect [10:13:04] yeah, the typology labels are auto-generated from netbox data now [10:13:42] ok so we should add a note about it in the docs [10:13:46] but the kubelet is not allowed to set the role [10:13:50] yeah, indeed [10:13:57] (I am going through the list for aux right now) [10:14:02] I've not updated the docs at all tbh [10:18:32] ah lovely aux uses calico-cni? [10:19:15] User "calico-cni" cannot list resource "ipreservations" in API group "crd.projectcalico.org" at the cluster scope [10:19:55] every cluster uses calico-cni [10:20:25] elukey: is the proper calico chart version installed? [10:21:59] jayme: of course not, my bad, mixed it up with the ml things [10:22:04] lemme merge the change [10:23:10] jayme: is it ok if I roll out the new calico-crds and calico charts on top of the old ones, or should I remove it first? (forgot that step, I thought the change was merged already) [10:23:50] I can wait for it to fail in theory [10:24:43] you should wait for the rollback I'd say or you might end up in a undefined state situation [10:24:56] yes yes my bad [10:25:08] (earlier on I confused the calico-cni with the istio-cni) [10:25:44] ETOOMANYTHINGSTOREMEMBER :) [10:27:17] jayme: one last thing - calico-crds went fine of course, should I do some clean up for them too first? [10:28:00] elukey: should not be required. Helm will update them with the new version [10:28:24] super [10:29:40] (apologies to infra foundations for the mess :) [10:40:21] jayme: calico pods up :) [10:40:40] nice [10:48:32] jayme: one weird thing about coredns - I see [10:48:38] [INFO] plugin/ready: Still waiting on: "kubernetes [10:48:48] one pod is running, the rest still logging the above [10:48:54] have you seen it before? [10:50:54] hmm...no [10:51:02] seems as if it is still waiting for the kubernetes api? [10:51:15] I would assume the same, yes [10:52:41] but that does not make a whole lot of sense... [10:53:02] I tried to kill a calico pod and now I get an error, so maybe this is the issue [10:54:22] 2023-02-15 10:53:29.292 [WARNING][9] startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized [10:54:35] from the calico-node-24qs2 log [10:55:40] yes and on the kube api I see [10:55:41] nable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primi [10:55:59] not sure if I forgot anything [10:56:51] now it works.. did you do something? [11:00:01] nope [11:00:11] just looking around :-) [11:01:28] tried to kill another one, and it shows the same error [11:01:43] that's strange... [11:01:51] you could try staging-codfw for reference [11:02:08] after a bit it works [11:02:10] lemme try [11:04:19] can't repro there [11:10:23] elukey: did you do https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878946/ ? Although I don't see why that should be a problem [11:11:37] mmm nope [11:11:46] I tried to kill calico-typha though [11:11:47] and I get [11:11:48] Failed to determine migration requirements error=unable to query ClusterInformation to determine Calico version: connection is unauthorized: Unauthorized [11:12:20] that feels as if service account keys are not signed properly...or not updated in time [11:12:32] (relocating brb) [11:21:28] now they are all up..mmm [11:23:03] jhathaway, cdanis: o/ - status update for aux-k8s on 1.23 - I tried to help and completed the cookbook started by Chris, plus I followed https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Clusters/New up to the Calico steps. Due to me being stupid I forgot to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889194 before starting, so the calico pods were not coming [11:23:09] up at first try.. [11:23:31] helm rolled back, I updated all the previous steps (including calico-crds) with the new resources, and retried calico.. that worked [11:24:05] so far the pods are up, but something strange happens if we tried to kill them - they initially error out, and then bootstrap correctly after a while [11:24:31] something similar happened for coredns, one our of 4 pods came up fine, the rest failed (and helm rolled back after a bit) [11:25:36] Summary - no idea why this is happening, I hope I haven't messed up calico's state with the error highlighted above. In case I'll offer to re-run the upgrade cookbook :) [11:25:53] I am going afk for lunch + some errands, will check later to see if we can find what's wrong [14:21:43] updated the rdf-streaming-updater chart again if anyone has time to look https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889172 [14:32:51] jayme: interesting, I see similar issues on ml-staging-codfw with calico [14:33:37] one calico pod is crashloop, one calico-typha errors out [14:33:52] I followed the steps on the wiki for syncs [14:34:11] rbac, pod-security-policies, namespaces, calico-crds, calico [14:37:06] same weird log on calico pods: [14:37:06] startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized [14:40:33] seems similar to https://github.com/projectcalico/calico/issues/5712 [14:41:25] but fixed in 3.23.2 afaics so we should be good [14:41:32] I bet it is an operator issue [14:41:33] :D [14:45:21] \o [14:50:47] elukey: sorry, got pulled into other things. I can take a closer look now [14:51:03] elukey: aux is still in that state as well, right? [14:51:21] so I could use that for digging so we don't step on each others toes? [14:53:25] jayme: yes yes correct! [14:53:32] ack [14:58:32] elukey: fwiw staging-eqiad does not show that behaviour [14:59:05] hey! elukey thanks for finishing, I had noted the issue with kubectl on the ctrl nodes in aux, but it was too late in the day to start debugging [15:00:38] cdanis: o/ not sure if I did something good or bad, my cluster and yours are in a bad state now :D [15:00:39] hi chris o/ [15:00:46] meanwhile the ones touched by jayme are good [15:00:49] elukey: it just means we will learn something new [15:00:52] this explains a lot of things I know [15:01:19] did you get to the root of the kubectl issue btw elukey or did you just skip the step? [15:01:33] * elukey sees cdanis taking a note "never ever let Luca use any infra that I work on" [15:01:55] cdanis: it was fixed by Janis, there was a missing profile, all good now [15:01:58] jayme any chance you could take a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889172 ? dcausse and I are meeting about it now [15:02:04] ok I figured it must be just a missing puppet profile [15:02:33] elukey: please, who am i to judge, i also enjoy being a professional chaos monkey [15:03:23] anyway let me know if I can help with the troubleshooting [15:08:03] inflatador: whats the reason for setting the limitranger to 4.5Gi memory while the actual limit is 4Gi? [15:09:34] jayme: to have more room if 4Gi ends up being not enough [15:10:30] we can increase directly to 4.5Gi in the service helmfile values if this makes more sense [15:11:58] dcausse: I was actually thinking that the extra 500M is not a lot of room compared to the 1500M you bump the container limit by :) [15:12:49] but given that the pod limit is 5Gi it probably makes kinda sense [15:13:40] inflatador: you need help deploying the admin_ng change? [15:15:31] jayme yes if you have time [15:15:42] we're in https://meet.google.com/qve-fycn-vpw if you wanna join [15:16:36] elukey: I'm a bit afraid realising (again) that we run the wikikube staging clusters with single control plane instances...that might be the difference we're looking for [15:18:26] also: did you see the ""GuaranteedUpdate etcd3" type:*coordination.Lease" messages from kube-apiserver? It seems as if it is unhappy with the fact that they take >500ms [15:21:24] ah no didn't see it [15:21:27] I keep staring at [15:21:27] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]" [15:21:50] in the kube-api logs, but not sure what it can relate to [15:22:43] on calico's pod logs I can clearly see problems trying to fetch a k8s node's data, so I am wondering if something like PKI certs for the control plane could cause this [15:22:56] totally speaking out loud, it may not really be anything useful [15:27:54] jayme: just to double check, was there anything to change on the puppet private side? [15:27:59] maybe a new setting etc.. [15:29:51] yeah, I was assuming something along those lines yes - maybe a race with multiple masters [15:30:10] will be back to it in a couple of minutes if inflatador does not break wikikube :-p [15:30:23] {◕ ◡ ◕} [15:39:40] dcausse | inflatador: for context on the jre stuff https://phabricator.wikimedia.org/T327799 [15:39:47] thanks! [15:41:09] elukey: to your actual question: no. I did not to and private puppet changes (apart from the pki related ones) [15:42:26] and adding the prometheus infrastructure user to the system:monitoring group [15:42:41] ack perfect, I am still following up to my theory that it may be an operator error :) [15:42:47] * elukey bbiab [15:46:50] elukey: hmm...I've deleted a calico-node pod on aux and it came back immediately :D [15:47:50] or did I totally misunderstood and the "Connection to the datastore is unauthorized" messages where only from typha? [15:48:17] ah, no... calico-node as well [16:23:35] jayme: it usually bootstrap correctly but initially I see crashloop/Error etc.. [16:23:53] elukey: yeah...I think I'm on the right track [16:24:09] which is a very dark one, leading no where nice [16:25:01] if you want to brainbounce lemme know [16:27:20] as this does not seem to happen with only one ctrl node, I //think// it might be that service-account-key-file and service-account-signing-key-file need to be the same on all ctrl nodes [16:28:01] ah so depending on what master the pod talks to, it can fail or not [16:28:07] is it the line of thinking? [16:28:13] yep [16:28:20] (so in wikikube staging the issue doesn't appear) [16:28:24] okok makes sense [16:28:25] yep [16:30:13] we could depool one of the master nodes from lvs to validate the theory [16:30:31] or just stop all k8s components [16:30:37] also yes [16:30:47] i already disabled puppet on aux 1002 [16:31:54] and stopped apiserver, controller-manager and scheduler there [16:32:14] we can also try to deploy coredns too [16:32:20] when i tried it failed [16:34:34] it already lead to at least one calico-node starting to fail [16:34:44] so I'd say it's confirmed [16:34:45] shit [16:36:01] :( [16:37:32] so, action item definitely is: add a second master to wikikube staging clusters :-| [16:37:32] is there any indication in the docs that the key-file(s) need to be the same on all master nodes? [16:37:53] not really...but docs are sparse [16:37:59] lovely [16:38:10] https://v1-23.docs.kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ ctrl+f for --service-account-key-file [16:38:35] I think that we need all (public) keys of all masters on all masters [16:38:47] multiple private keys //should// be fine [16:40:28] kube-controller-manager will use that private key (--service-account-private-key-file) to sign service account tokens [16:40:54] apiserver will use that private key (--service-account-signing-key-file) to issue ID tokens [16:43:31] there is also some documentation on https://kubernetes.io/docs/setup/best-practices/certificates/ [16:46:18] to understand - the goal of having all public keys on all masters is that if, for example, kube-controller-manager on 1001 signs a token with its private key then the other master can verify it when a request arrived with that token [16:46:39] (this part of k8s is still a bit foggy to me) [16:47:26] yes [16:47:36] okok yes it makes sense [16:47:54] obviously there is no way to get "this other public key" from PKI [16:49:02] future problem...let's try to verify [16:59:11] elukey: all up with both public certs [17:03:52] jayme: nice finding :) [17:04:11] so you just appended the public cert to the file? [17:04:23] pretty heavy miss in first place, though :| [17:04:42] --service-account-key-file can be specified multiple times, that's what I did [17:05:01] ah okok like it was described in the docs, now I get it [17:05:12] don't worry it is just a staging problem, we need multiple masters in there [17:05:28] I wouldn't have guessed that the problem was this one to be honest :) [17:05:36] so good job anyway in debugging :) [17:05:41] y3 [17:05:45] *<3 [17:05:49] jayme: did you do it manually or do you have a puppet patch? [17:05:59] now the main issue IIUC is how to create a bundle from PKI [17:06:01] cdanis: manually currently [17:06:24] (need to step afk, will read later!) [17:06:27] I'm not sure how to transport the public key(s) between servers [17:06:48] this is a keypair generated by cfssl? [17:06:55] and now I start to wonder why this issue did not pop up during key rotations... [17:06:58] cdanis: yes [17:07:19] all of the certs related to k8s components are coming from PKI now [17:08:29] I can imagine doing some things with puppet exported resources perhaps but jbond probably has better ideas [17:08:56] (and that would still potentially have bootstrapping problems) [17:09:13] (the usual issue of "ok you ran puppet everywhere? ok run it again") [17:10:38] yeah :/ [17:47:50] If nobody objects I'll leave aux cluster in the manually patched state with puppet disabled (on ctrl nodes) for my version of today [17:48:03] will get back to this tomorrow [17:48:10] (my version of tomorrow :)) [18:16:10] sounds good jayme ! I posted a quick summary at https://phabricator.wikimedia.org/T329633#8619685 for posterity [18:16:15] hopefully I got most of it right [18:18:07] right enough :) - thanks! [18:18:35] I'm out for now. ttyl