[09:01:35] blancadesal: I'm available for the k8s upgrade on toolsbeta, just ping me [09:02:45] arturo: I'm fine with going ahead with the remaining hosts on my own, as long as you're around if anything unexpected happens [09:04:31] I'll just finish up watching yesterday's staff meeting, I can ping you when I start if that's ok with you [09:05:58] blancadesal: excellent :-) [09:06:09] I'll stand by -- I'm preparing Kyverno demo for next week [09:06:18] nice! [09:37:07] arturo: I will start with toolsbeta-test-k8s-worker-11 shortly [09:37:45] 👍 [09:46:39] arturo: going ahead with the remaining nfs worker nodes now [09:46:44] ack [09:53:58] arturo: when doing the ingress node yesterday, did we scale down ingress-nginx-gen2-controller from 3 to 2 replicas before? [09:54:18] no [09:54:30] we skipped that part [09:54:43] (or, missed) [09:55:35] prob my fault, forgot to copy-paste that part to the etherpad [09:56:09] no problem, is not really that important, even less in toolsbeta [09:56:11] the thing is [09:56:22] the nginx-ingress pods take a long time to stop/start [09:56:27] they are usually very busy [09:56:56] the theory by taavi was that downscaling the pod beforehand would make the upgrade more graceful for endusers [09:57:14] because otherwise the cookbook just kills without mercy the pod [09:57:27] maybe we can skip that part on toolsbeta then? [09:57:29] and there could be some requests in flight failing because the pod is no longer alive [09:58:07] yeah, I think skipping for toolsbeta is fine [09:58:20] 👍 [10:00:02] also, the behavior that I'm describing is likely changing when the system evolves (new version of k8s, nginx, kube-proxy, etc) [10:00:21] so at some point we should check if the failure mode is still a thing, or not [10:00:30] arturo: unrelated: how could we help this user? https://toolsadmin.wikimedia.org/tools/membership/status/1756 [10:01:24] mmm [10:01:28] that's unexpected [10:02:14] let's approve again, maybe the whatever approval workflow failed, and approving again will trigger it again [10:02:58] ack [10:03:18] thanks [10:06:30] arturo: this request seems to have gotten stuck/forgotten, do you know if there is something blocking it? T364761 [10:06:32] T364761: Request for access for user dr0ptp4kt for 'admin' tool - https://phabricator.wikimedia.org/T364761 [10:10:44] blancadesal: I don't think there is any blocker [10:15:57] arturo: so if I understand right, I need to add the user to here: https://toolsadmin.wikimedia.org/tools/id/admin and also to the "roots" sudoers group via Horizon as per bd80.8 comment? [10:16:10] yeah [10:17:18] blancadesal: I see all workers are now in 1.25 🎉 [10:17:37] ingress too! [10:19:23] the dump & load test has been consistently failing during the worker upgrade, except on the first run pre-upgrade [10:19:41] doing another run now that things seem to have stabilized [10:21:46] maybe jobs-api pods are not reporting its internal state well to k8s, and they are being given requests before they are ready [10:22:00] still failing: [10:22:11] https://www.irccloud.com/pastebin/aVP5KD3G/ [10:22:54] mmmm [10:22:57] could that be a missing cleanup? [10:23:16] blancadesal: try again [10:23:25] there was a dangling job defined [10:23:27] https://www.irccloud.com/pastebin/wYzHSPoL/ [10:24:05] maybe bad timing when this job got evicted? [10:26:30] toolforge jobs flush/delete isn't getting rid of it [10:26:49] mmm [10:26:52] that's unexpected [10:27:36] well, it is in terminating state [10:29:17] let me check the state of the worker node [10:30:47] the pod is not running in the worker node [10:30:51] aborrero@toolsbeta-test-k8s-worker-nfs-2:~$ sudo -i crictl ps [10:32:36] https://www.irccloud.com/pastebin/yvOMG2T6/ [10:33:20] there seems to be some problem with the envvars admission? [10:35:03] https://www.irccloud.com/pastebin/lVCL7zxe/ [10:35:53] what makes you think envvars admission is the issue? [10:37:17] the error suggests there is an invalid patch being applied to the pod [10:37:23] it seems there's an attempt to add envvars to the pod spec after creation, which is forbidden? [10:37:30] and the patch is about adding envvars [10:37:35] yeah [10:38:34] could this be an artifact of one of the tests being interrupted during the upgrade, with the pod already existing the next time the test ran? [10:42:58] I don't know [10:45:50] I'm deleting the few defined envvars, to see if that makes any difference [10:46:14] https://www.irccloud.com/pastebin/sfEP24it/ [10:46:35] ok [10:48:58] the envvars admission webhook is triggered for pod updates: [10:48:58] https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/blob/main/deployment/chart/templates/webhook.yaml.tpl?ref_type=heads#L22 [10:49:25] this may not be correct, because as reported by the k8s system, once defined, pods have most of their fields immutable [10:50:05] so injecting envvars on a pod UPDATE operation may result in the violation we are seeing now [10:51:11] interesting [10:51:41] citation: https://kubernetes.io/docs/concepts/workloads/pods/#pod-update-and-replacement [10:51:49] this is also supported by https://stackoverflow.com/questions/77323629/admission-webhook-pod-update [10:53:25] that seems to be it. interesting that we have never run into this issue before [10:54:36] what is updating pods? [10:55:23] * dcaro fades away and unloads the ffx tab [10:56:07] https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/8 [10:56:26] this may have been copy-pasted, so other admission controllers may have the same thing [10:59:21] https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/13 [10:59:38] why is CI so slow? [11:00:33] :-( I don't know [11:05:02] blancadesal: BTW deleting the ennvars has resulted in the pod being cleaned up normally by the system [11:05:15] nice [11:05:50] could you please run the functional tests again? [11:08:00] I'm creating a ticket [11:08:01] sure [11:08:12] T369890 [11:08:13] T369890: toolforge: kubernetes fails to handle some pods that are being mutated by our admission controllers - https://phabricator.wikimedia.org/T369890 [11:15:19] arturo: tests run fine now [11:15:28] great [11:15:42] except for this: T369891 [11:15:43] T369891: [toolforge deploy] direct-api tests fail intermittently on toolsbeta - https://phabricator.wikimedia.org/T369891 [11:16:06] happens to me in 1/10 runs, approx [11:17:03] this seems like a failure being reported by the api-gateway [11:17:07] { [11:17:07] "detail": "Connection error with backend API while fetching url https://builds-api.builds-api.svc.toolsbeta.local:8443/openapi.json." [11:17:07] } [11:18:07] only happens on toolsbeta, for some reason [11:18:25] I would check the health of the builds-api pods [11:19:02] golang-ci lint seems stuck again btw [11:19:21] i'll go for lunch [11:21:01] re the builds-api pods, haven't seen anything blatantly wrong. might explore a bit more later [11:21:21] at any rate, the toolsbeta k8s upgrade is done! [11:21:33] yeah! 🎉 [11:21:53] I will send the email about the tools upgrade now [11:23:20] ahh, I was going to do it, but if you want to do it instead I'm not going to complain xd [11:23:38] ok, I'm about to click the 'send' button heh [11:23:47] would you like to review it? [11:28:01] arturo: sure [11:28:43] https://www.irccloud.com/pastebin/9n03qOJu/ [11:28:47] blancadesal: ^^^ [11:30:01] looks, good, just the date should be 2024-07-16 [11:30:16] right, fixed and sending [11:31:21] oh, I used the wrong From address :-( [11:31:25] I'll need to resend most likely [11:31:59] please moderators, don't approve the email [11:32:00] https://www.irccloud.com/pastebin/xD4iFlyr/ [11:33:11] arturo: i've rejected it [11:33:17] thanks [11:39:01] the new email also requires approval [11:39:18] it went through [11:39:26] oh ok [11:39:52] I got the 'needs approval' email too though [11:40:00] it doesn't show up in https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/ [11:40:04] at the same time [11:40:04] maybe you got the cloud@ copy [11:40:17] I usually CC cloud@ [11:40:18] ah, maybe [11:40:24] let me approve it then [11:40:44] why do my emails require approval? [11:42:34] i don't know, 'moderation action' says 'none' [11:42:51] https://usercontent.irccloud-cdn.com/file/yh2y2ltM/Screenshot%202024-07-12%20at%2013.42.04.png [11:43:36] mine looks like this though: [11:43:52] https://usercontent.irccloud-cdn.com/file/ZSuObovy/Screenshot%202024-07-12%20at%2013.43.23.png [11:44:09] * arturo brb [11:46:01] you are also a list owner, so no idea why that happens [11:46:50] * blancadesal goes for lunch, this time for real [12:06:24] I dont think I have the password [12:12:37] It's in pwstore I think [12:13:32] ok [12:31:39] there is an alert about puppet performing changes on every run on cloudcuimin1001 [12:31:48] it was because this diff [12:31:52] https://www.irccloud.com/pastebin/xHXo2HaH/ [14:40:28] sorry about the stray debug line! [16:11:21] np [16:11:26] * arturo offline [18:06:01] I think I fixed your cloud-announce list moderation bit a.rturo [18:21:17] bd808: (since you are here on your day off) any idea if deployment-prep actually uses etcd for anything? There's a node there but it looks idle to me [18:21:47] (that or v2/migration/snapshot is lying to me) [18:24:14] andrewbogott: I don’t know specifically, but I would expect that if it is in use it would be for both MediaWiki config (db mappings) and possibly CDN edge traffic shedding rules [18:24:47] Hm, I wonder what the deal is with snap being empty? [18:24:58] I assume if I just flip everything over to a new etcd node the result won't be great [18:25:24] time to make a cluster I guess [18:26:15] I thought I would upgrade a few more deployment-prep nodes myself to build up some moral authority before complaining that the other nodes aren't getting upgraded [18:26:26] It is totally possible that the node is unused. I really don’t know how deployment-prep works anymore. Those are just non-k8s things I know we use etcd for in prod. [18:27:29] https://meta.wikimedia.beta.wmflabs.org/wiki/Main_Page definitely still works if I stop the node [18:27:30] I would like to help with migrations there too. I just haven’t devoted time to it yet. [18:27:37] but that doesn't prove a lot [18:28:34] I can't decide if in-place upgrades are the right solution or the wrong one. It feels like embracing tech debt to say "no one knows how to builds these or cares to learn" but it does solve the immediate problem... [18:29:54] * andrewbogott takes a break rather than just switching things off for good to see who complains [18:35:37] andrewbogott: I'm pretty sure etcd has some level of cache [18:38:00] If it's cache then I really can just start from scratch [18:39:30] Etcd itself isn't cache [18:39:49] But I'm pretty sure mediawiki caches etcd a bit so it won't die straight away if it fails [19:19:44] ah I see