[08:54:57] <jayme>	 cdanis: ack. Please share how it went :-)
[09:04:58] <elukey>	 jayme: I am progressing the cookbook, almost done (two worker nodes were left, now it is finishing the last one)
[09:05:14] <elukey>	 weird thing - kubernetes-client (and hence kubectl) was not installed on aux ctrl nodes
[09:05:15] <jayme>	 elukey: nice!
[09:05:23] <elukey>	 (so the verification step failed)
[09:05:24] <jayme>	 not so nice
[09:05:30] <elukey>	 the rest looks fine though
[09:06:34] <jayme>	 you say on aux ctrl nodes... on ml-staging ctrl nodes it got installed?
[09:06:43] <elukey>	 yes 
[09:06:50] <elukey>	 just to add more weirdness
[09:06:59] <jayme>	 hmpf
[09:08:32] <jayme>	 I'll take a look
[09:09:26] <jayme>	 ml-staging-ctrl* still got 1.20.5+really1.20.2-1 though...:/
[09:10:04] <jayme>	 ah, you've not reimaged them after I merged the packaging patch
[09:10:41] <elukey>	 akosiaris: o/ IIRC you did some etcd reimages with confXXXX nodes, I have a question about etcd. I naively thought that if a node is reimaged then it auto recovers when it starts again (so it asks for the raft log from other nodes and rejoin the cluster). What I found from my tests in staging is that the reimaged node refuses to restart, since the raft log is "behind" the rest of the cluster 
[09:10:47] <elukey>	 (since it is empty). The k8s upgrade cookbook uses the "one-node-at-the-time" etcd reimage, but to make it work I have to (manually) remove the reimaged node via etcdctl and re-add it (to allow it to appear as new and ask the raft log from other nodes). Is there anything that I am missing?
[09:11:07] <elukey>	 jayme: I am doing it this morning, if the etcd cluster allows me :D
[09:11:28] <jayme>	 elukey: ack
[09:12:54] <jayme>	 dcausse: o/ will you be around on 2023-02-21 to assist with properly starting and stopping rdf-streaming-updater in codfw?
[09:13:25] <jayme>	 properly as in "the tested way" :)
[09:13:32] <dcausse>	 jayme: o/ yes I should be here! :)
[09:13:35] <jayme>	 https://phabricator.wikimedia.org/T329664
[09:13:39] <jayme>	 dcausse: cool
[09:18:23] <jayme>	 made that explicit in the action item list
[09:27:19] <jayme>	 elukey: role::aux_k8s::master does not include profile::kubernetes::client - so all good
[09:32:47] <elukey>	 jayme: ah nice! Didn't see it.. Should it have it?
[09:32:58] <elukey>	 anyway, aux's cookbook run completed!
[09:33:51] <jayme>	 yay
[09:34:38] <jayme>	 it's not really required on masters tbh but I think it's nice to have it around just in case (like deployment server failing - whatever)
[09:34:42] <jayme>	 it should not hurt
[09:35:26] <jayme>	 we had a comment in some of the roles that it is not required, though. I think that's why it was left out when creating aux
[09:35:56] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/889486
[09:39:20] <elukey>	 +1ed
[09:41:19] <jayme>	 ahah, pcc Hosts: auto choose to not do aux ;p
[09:45:08] <jayme>	 ...where it failed for different reasons
[10:04:44] <elukey>	 jayme: just to confirm, https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Clusters/New#Label_Kubernetes_Masters is still the same on 1.23 right?
[10:05:08] <jayme>	 yeah, no change there
[10:06:11] <elukey>	 and what about the worker nodes? I was trying to figure out what to do for the aux ones (ganeti vms)
[10:11:31] <elukey>	 ah lovely they are already added
[10:11:34] <elukey>	 perfect
[10:13:04] <jayme>	 yeah, the typology labels are auto-generated from netbox data now
[10:13:42] <elukey>	 ok so we should add a note about it in the docs
[10:13:46] <jayme>	 but the kubelet is not allowed to set the role
[10:13:50] <jayme>	 yeah, indeed
[10:13:57] <elukey>	 (I am going through the list for aux right now)
[10:14:02] <jayme>	 I've not updated the docs at all tbh
[10:18:32] <elukey>	 ah lovely aux uses calico-cni?
[10:19:15] <elukey>	 User "calico-cni" cannot list resource "ipreservations" in API group "crd.projectcalico.org" at the cluster scope
[10:19:55] <jayme>	 every cluster uses calico-cni
[10:20:25] <jayme>	 elukey: is the proper calico chart version installed?
[10:21:59] <elukey>	 jayme: of course not, my bad, mixed it up with the ml things
[10:22:04] <elukey>	 lemme merge the change
[10:23:10] <elukey>	 jayme: is it ok if I roll out the new calico-crds and calico charts on top of the old ones, or should I remove it first? (forgot that step, I thought the change was merged already)
[10:23:50] <elukey>	 I can wait for it to fail in theory
[10:24:43] <jayme>	 you should wait for the rollback I'd say or you might end up in a undefined state situation
[10:24:56] <elukey>	 yes yes my bad
[10:25:08] <elukey>	 (earlier on I confused the calico-cni with the istio-cni)
[10:25:44] <elukey>	 ETOOMANYTHINGSTOREMEMBER :)
[10:27:17] <elukey>	 jayme: one last thing - calico-crds went fine of course, should I do some clean up for them too first?
[10:28:00] <jayme>	 elukey: should not be required. Helm will update them with the new version
[10:28:24] <elukey>	 super
[10:29:40] <elukey>	 (apologies to infra foundations for the mess :)
[10:40:21] <elukey>	 jayme: calico pods up :)
[10:40:40] <jayme>	 nice
[10:48:32] <elukey>	 jayme: one weird thing about coredns - I see
[10:48:38] <elukey>	 [INFO] plugin/ready: Still waiting on: "kubernetes
[10:48:48] <elukey>	 one pod is running, the rest still logging the above
[10:48:54] <elukey>	 have you seen it before?
[10:50:54] <jayme>	 hmm...no
[10:51:02] <elukey>	 seems as if it is still waiting for the kubernetes api?
[10:51:15] <jayme>	 I would assume the same, yes
[10:52:41] <jayme>	 but that does not make a whole lot of sense...
[10:53:02] <elukey>	 I tried to kill a calico pod and now I get an error, so maybe this is the issue
[10:54:22] <jayme>	 2023-02-15 10:53:29.292 [WARNING][9] startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized
[10:54:35] <jayme>	 from the calico-node-24qs2 log
[10:55:40] <elukey>	 yes and on the kube api I see
[10:55:41] <elukey>	 nable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primi
[10:55:59] <elukey>	 not sure if I forgot anything
[10:56:51] <elukey>	 now it works.. did you do something?
[11:00:01] <jayme>	 nope
[11:00:11] <jayme>	 just looking around :-)
[11:01:28] <elukey>	 tried to kill another one, and it shows the same error
[11:01:43] <jayme>	 that's strange...
[11:01:51] <jayme>	 you could try staging-codfw for reference
[11:02:08] <elukey>	 after a bit it works
[11:02:10] <elukey>	 lemme try
[11:04:19] <elukey>	 can't repro there
[11:10:23] <jayme>	 elukey: did you do https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878946/ ? Although I don't see why that should be a problem
[11:11:37] <elukey>	 mmm nope
[11:11:46] <elukey>	 I tried to kill calico-typha though
[11:11:47] <elukey>	 and I get
[11:11:48] <elukey>	 Failed to determine migration requirements error=unable to query ClusterInformation to determine Calico version: connection is unauthorized: Unauthorized
[11:12:20] <jayme>	 that feels as if service account keys are not signed properly...or not updated in time
[11:12:32] <elukey>	 (relocating brb)
[11:21:28] <elukey>	 now they are all up..mmm
[11:23:03] <elukey>	 jhathaway, cdanis: o/ - status update for aux-k8s on 1.23 - I tried to help and completed the cookbook started by Chris, plus I followed https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Clusters/New up to the Calico steps. Due to me being stupid I forgot to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889194 before starting, so the calico pods were not coming 
[11:23:09] <elukey>	 up at first try..
[11:23:31] <elukey>	 helm rolled back, I updated all the previous steps (including calico-crds) with the new resources, and retried calico.. that worked
[11:24:05] <elukey>	 so far the pods are up, but something strange happens if we tried to kill them - they initially error out, and then bootstrap correctly after a while
[11:24:31] <elukey>	 something similar happened for coredns, one our of 4 pods came up fine, the rest failed (and helm rolled back after a bit)
[11:25:36] <elukey>	 Summary - no idea why this is happening, I hope I haven't messed up calico's state with the error highlighted above. In case I'll offer to re-run the upgrade cookbook :)
[11:25:53] <elukey>	 I am going afk for lunch + some errands, will check later to see if we can find what's wrong
[14:21:43] <inflatador>	 updated the rdf-streaming-updater chart again if anyone has time to look https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889172
[14:32:51] <elukey>	 jayme: interesting, I see similar issues on ml-staging-codfw with calico
[14:33:37] <elukey>	 one calico pod is crashloop, one calico-typha errors out
[14:33:52] <elukey>	 I followed the steps on the wiki for syncs
[14:34:11] <elukey>	 rbac, pod-security-policies, namespaces, calico-crds, calico
[14:37:06] <elukey>	 same weird log on calico pods:
[14:37:06] <elukey>	 startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized
[14:40:33] <elukey>	 seems similar to https://github.com/projectcalico/calico/issues/5712
[14:41:25] <elukey>	 but fixed in 3.23.2 afaics so we should be good
[14:41:32] <elukey>	 I bet it is an operator issue
[14:41:33] <elukey>	 :D
[14:45:21] <klausman>	 \o
[14:50:47] <jayme>	 elukey: sorry, got pulled into other things. I can take a closer look now
[14:51:03] <jayme>	 elukey: aux is still in that state as well, right?
[14:51:21] <jayme>	 so I could use that for digging so we don't step on each others toes?
[14:53:25] <elukey>	 jayme: yes yes correct!
[14:53:32] <jayme>	 ack
[14:58:32] <jayme>	 elukey: fwiw staging-eqiad does not show that behaviour
[14:59:05] <cdanis>	 hey! elukey thanks for finishing, I had noted the issue with kubectl on the ctrl nodes in aux, but it was too late in the day to start debugging
[15:00:38] <elukey>	 cdanis: o/ not sure if I did something good or bad, my cluster and yours are in a bad state now :D
[15:00:39] <jayme>	 hi chris o/
[15:00:46] <elukey>	 meanwhile the ones touched by jayme are good
[15:00:49] <cdanis>	 elukey: it just means we will learn something new
[15:00:52] <elukey>	 this explains a lot of things I know
[15:01:19] <cdanis>	 did you get to the root of the kubectl issue btw elukey or did you just skip the step?
[15:01:33] * elukey sees cdanis taking a note "never ever let Luca use any infra that I work on"
[15:01:55] <elukey>	 cdanis: it was fixed by Janis, there was a missing profile, all good now
[15:01:58] <inflatador>	 jayme any chance you could take a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889172 ? dcausse and I are meeting about it now
[15:02:04] <cdanis>	 ok I figured it must be just a missing puppet profile
[15:02:33] <cdanis>	 elukey: please, who am i to judge, i also enjoy being a professional chaos monkey
[15:03:23] <cdanis>	 anyway let me know if I can help with the troubleshooting
[15:08:03] <jayme>	 inflatador: whats the reason for setting the limitranger to 4.5Gi memory while the actual limit is 4Gi?
[15:09:34] <dcausse>	 jayme: to have more room if 4Gi ends up being not enough
[15:10:30] <dcausse>	 we can increase directly to 4.5Gi in the service helmfile values if this makes more sense
[15:11:58] <jayme>	 dcausse: I was actually thinking that the extra 500M is not a lot of room compared to the 1500M you bump the container limit by :)
[15:12:49] <jayme>	 but given that the pod limit is 5Gi it probably makes kinda sense
[15:13:40] <jayme>	 inflatador: you need help deploying the admin_ng change?
[15:15:31] <inflatador>	 jayme yes if you have time
[15:15:42] <inflatador>	 we're in https://meet.google.com/qve-fycn-vpw if you wanna join
[15:16:36] <jayme>	 elukey: I'm a bit afraid realising (again) that we run the wikikube staging clusters with single control plane instances...that might be the difference we're looking for
[15:18:26] <jayme>	 also: did you see the ""GuaranteedUpdate etcd3" type:*coordination.Lease" messages from kube-apiserver? It seems as if it is unhappy with the fact that they take >500ms 
[15:21:24] <elukey>	 ah no didn't see it
[15:21:27] <elukey>	 I keep staring at 
[15:21:27] <elukey>	 "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
[15:21:50] <elukey>	 in the kube-api logs, but not sure what it can relate to
[15:22:43] <elukey>	 on calico's pod logs I can clearly see problems trying to fetch a k8s node's data, so I am wondering if something like PKI certs for the control plane could cause this 
[15:22:56] <elukey>	 totally speaking out loud, it may not really be anything useful
[15:27:54] <elukey>	 jayme: just to double check, was there anything to change on the puppet private side?
[15:27:59] <elukey>	 maybe a new setting etc..
[15:29:51] <jayme>	 yeah, I was assuming something along those lines yes - maybe a race with multiple masters
[15:30:10] <jayme>	 will be back to it in a couple of minutes if inflatador does not break wikikube :-p
[15:30:23] <inflatador>	 {◕ ◡ ◕}
[15:39:40] <jayme>	 dcausse | inflatador: for context on the jre stuff https://phabricator.wikimedia.org/T327799
[15:39:47] <dcausse>	 thanks!
[15:41:09] <jayme>	 elukey: to your actual question: no. I did not to and private puppet changes (apart from the pki related ones)
[15:42:26] <jayme>	 and adding the prometheus infrastructure user to the system:monitoring group
[15:42:41] <elukey>	 ack perfect, I am still following up to my theory that it may be an operator error :)
[15:42:47] * elukey bbiab
[15:46:50] <jayme>	 elukey: hmm...I've deleted a calico-node pod on aux and it came back immediately :D
[15:47:50] <jayme>	 or did I totally misunderstood and the "Connection to the datastore is unauthorized" messages where only from typha?
[15:48:17] <jayme>	 ah, no... calico-node as well
[16:23:35] <elukey>	 jayme: it usually bootstrap correctly but initially I see crashloop/Error etc..
[16:23:53] <jayme>	 elukey: yeah...I think I'm on the right track
[16:24:09] <jayme>	 which is a very dark one, leading no where nice
[16:25:01] <elukey>	 if you want to brainbounce lemme know
[16:27:20] <jayme>	 as this does not seem to happen with only one ctrl node, I //think// it might be that service-account-key-file and service-account-signing-key-file need to be the same on all ctrl nodes
[16:28:01] <elukey>	 ah so depending on what master the pod talks to, it can fail or not
[16:28:07] <elukey>	 is it the line of thinking?
[16:28:13] <jayme>	 yep
[16:28:20] <elukey>	 (so in wikikube staging the issue doesn't appear)
[16:28:24] <elukey>	 okok makes sense
[16:28:25] <jayme>	 yep
[16:30:13] <elukey>	 we could depool one of the master nodes from lvs to validate the theory
[16:30:31] <jayme>	 or just stop all k8s components
[16:30:37] <elukey>	 also yes
[16:30:47] <jayme>	 i already disabled puppet on aux 1002
[16:31:54] <jayme>	 and stopped apiserver, controller-manager and scheduler there
[16:32:14] <elukey>	 we can also try to deploy coredns too
[16:32:20] <elukey>	 when i tried it failed
[16:34:34] <jayme>	 it already lead to at least one calico-node starting to fail
[16:34:44] <jayme>	 so I'd say it's confirmed
[16:34:45] <jayme>	 shit
[16:36:01] <elukey>	 :(
[16:37:32] <jayme>	 so, action item definitely is: add a second master to wikikube staging clusters :-|
[16:37:32] <elukey>	 is there any indication in the docs that the key-file(s) need to be the same on all master nodes?
[16:37:53] <jayme>	 not really...but docs are sparse
[16:37:59] <elukey>	 lovely
[16:38:10] <jayme>	 https://v1-23.docs.kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ ctrl+f for --service-account-key-file
[16:38:35] <jayme>	 I think that we need all (public) keys of all masters on all masters
[16:38:47] <jayme>	 multiple private keys //should// be fine
[16:40:28] <jayme>	 kube-controller-manager will use that private key (--service-account-private-key-file) to sign service account tokens
[16:40:54] <jayme>	 apiserver will use that private key (--service-account-signing-key-file) to issue ID tokens
[16:43:31] <jayme>	 there is also some documentation on https://kubernetes.io/docs/setup/best-practices/certificates/
[16:46:18] <elukey>	 to understand - the goal of having all public keys on all masters is that if, for example, kube-controller-manager on 1001 signs a token with its private key then the other master can verify it when a request arrived with that token
[16:46:39] <elukey>	 (this part of k8s is still a bit foggy to me)
[16:47:26] <jayme>	 yes
[16:47:36] <elukey>	 okok yes it makes sense
[16:47:54] <jayme>	 obviously there is no way to get "this other public key" from PKI
[16:49:02] <jayme>	 future problem...let's try to verify
[16:59:11] <jayme>	 elukey: all up with both public certs
[17:03:52] <elukey>	 jayme: nice finding :)
[17:04:11] <elukey>	 so you just appended the public cert to the file?
[17:04:23] <jayme>	 pretty heavy miss in first place, though :|
[17:04:42] <jayme>	 --service-account-key-file can be specified multiple times, that's what I did
[17:05:01] <elukey>	 ah okok like it was described in the docs, now I get it
[17:05:12] <elukey>	 don't worry it is just a staging problem, we need multiple masters in there
[17:05:28] <elukey>	 I wouldn't have guessed that the problem was this one to be honest :)
[17:05:36] <elukey>	 so good job anyway in debugging :)
[17:05:41] <jayme>	 y3
[17:05:45] <jayme>	 *<3
[17:05:49] <cdanis>	 jayme: did you do it manually or do you have a puppet patch?
[17:05:59] <elukey>	 now the main issue IIUC is how to create a bundle from PKI
[17:06:01] <jayme>	 cdanis: manually currently
[17:06:24] <elukey>	 (need to step afk, will read later!)
[17:06:27] <jayme>	 I'm not sure how to transport the public key(s) between servers
[17:06:48] <cdanis>	 this is a keypair generated by cfssl?
[17:06:55] <jayme>	 and now I start to wonder why this issue did not pop up during key rotations...
[17:06:58] <jayme>	 cdanis: yes
[17:07:19] <jayme>	 all of the certs related to k8s components are coming from PKI now
[17:08:29] <cdanis>	 I can imagine doing some things with puppet exported resources perhaps but jbond probably has better ideas
[17:08:56] <cdanis>	 (and that would still potentially have bootstrapping problems)
[17:09:13] <cdanis>	 (the usual issue of "ok you ran puppet everywhere? ok run it again")
[17:10:38] <jayme>	 yeah :/
[17:47:50] <jayme>	 If nobody objects I'll leave aux cluster in the manually patched state with puppet disabled (on ctrl nodes) for my version of today
[17:48:03] <jayme>	 will get back to this tomorrow
[17:48:10] <jayme>	 (my version of tomorrow :))
[18:16:10] <cdanis>	 sounds good jayme !  I posted a quick summary at https://phabricator.wikimedia.org/T329633#8619685 for posterity
[18:16:15] <cdanis>	 hopefully I got most of it right
[18:18:07] <jayme>	 right enough :) - thanks!
[18:18:35] <jayme>	 I'm out for now. ttyl