[10:23:48] I've got an interesting situation with an eventgate-main pod that has been in a terminating state for over 5 days on the staging cluster. It's blocking a deploy to production. [10:24:19] Should I simply `kubectl delete` the pod, or does it warrant further investigation? [10:24:30] https://www.irccloud.com/pastebin/tlYarfMr/ [10:55:32] 32m Normal TaintManagerEviction pod/eventgate-production-78b98d8fb5-2h9fz Cancelling deletion of Pod eventgate-main/eventgate-production-78b98d8fb5-2h9fz [10:55:32] 27m Normal TaintManagerEviction pod/eventgate-production-78b98d8fb5-2h9fz Marking for deletion Pod eventgate-main/eventgate-production-78b98d8fb5-2h9fz [10:55:32] 56m Warning FailedScheduling pod/eventgate-production-8d5bb48bc-9z9vd 0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate. [10:55:32] 46m Warning FailedScheduling pod/eventgate-production-8d5bb48bc-9z9vd 0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate. [10:55:32] 46m Warning FailedScheduling pod/eventgate-production-8d5bb48bc-9z9vd skip schedule deleting pod: eventgate-main/eventgate-production-8d5bb48bc-9z9vd [10:55:37] btullis ^ [10:55:48] I am not sure you 'll be able to just delete it [10:56:03] it was scheduled for deletion and then cancelled [10:56:15] and then remarked for deletion again? [10:56:58] Ah, OK, thanks. I only touched it today with a normal `helmfile -e staging -i apply` - It looks like something happened 5 days ago though. [10:57:00] ah, kubestage1004 is marked as non ready [10:57:59] yeah [10:58:01] MemoryPressure Unknown Wed, 07 Jun 2023 08:22:21 +0000 Wed, 07 Jun 2023 08:24:29 +0000 NodeStatusUnknown Kubelet stopped posting node status. [10:58:03] and more [10:58:08] interesting [10:59:58] something something cadvisor conflicting with kubelet apparently ? [11:00:42] sudo systemctl cat kubelet |grep '^# /' [11:00:42] # /lib/systemd/system/kubelet.service [11:00:42] # /etc/systemd/system/cadvisor.service.d/puppet-override.conf [11:00:45] ehm, what? [11:01:56] elukey: is this what you witnessed with filippo ^ ? [11:03:38] so, /etc/systemd/system/cadvisor.service -> /lib/systemd/system/kubelet.service [11:03:42] this is the reason for this [11:03:50] and then the /etc/systemd/system/cadvisor.service.d/puppet-override.conf says [11:03:55] ExecStart=/usr/bin/cadvisor --listen_ip=10.64.48.106 --port=4194 --enable_metrics=accelerator,app,cpu,disk,diskIO,memory,network,oom_event,perf_event --docker=/dev/null [11:03:59] and the kubelet doesn't start [11:04:40] thankfully it's the only node that this happens on [11:04:49] the other 88 nodes apparently don't have that [11:04:54] 87 [11:07:19] well, thankfully we did not enable the cadvisor change across all of the kubernetes clusters, otherwise we 'd be having tons of fun (not) [11:11:49] ok, I put 2+2 together [11:12:32] so, all kubernetes clusters, with the exception of staging are now exempt from having cadvisor enabled [11:12:57] across the fleet, in enabled clusters, only 20% of machines have it enabled [11:13:10] kubestage1004 apparently rolled the d20 and got a 1, kubestage1003 did not [11:17:33] Setting up kubernetes-node (1.23.14-1) ... [11:17:33] ... [11:17:34] Created symlink /etc/systemd/system/cadvisor.service → /lib/systemd/system/kubelet.service. [11:17:34] ... [11:17:48] ok, found why that symlink existed in the first place [11:19:06] ok, the new package (.14-2) correctly doesn't create that symlink [11:21:17] ok, kubestage1004 is now catching up [11:21:46] btullis: I think you are now good to go [11:22:49] I 'll write up a summary under https://phabricator.wikimedia.org/T337836 and cleanup steps after I am done with some errands [11:28:20] akosiaris: Thank you so much. That's super-helpful. [12:30:54] akosiaris: sorry just seen the msg, yes :(