[09:58:16] <taavi>	 current status: https://prod-soju-upload.public.object.majava.org/taavi/XXojcjjCaADVwpPyfDqVeaLH-reimaging.png
[10:16:08] <dhinus>	 :D
[12:23:50] * dcaro around for the afternoon
[13:15:18] <taavi>	 quarry's apparently down again
[13:17:16] <taavi>	 quarry-127b-3lqizumia4xn-node-1 is somehow upset?
[13:17:22] <dcaro>	 👀
[13:18:14] <taavi>	 nothing in the console, so rebooting
[13:18:32] <dcaro>	  message: Kubelet stopped posting node status. 
[13:20:15] <taavi>	 that doesn't seem to have helped
[13:21:49] <dcaro>	 the node shows up
[13:21:51] <dcaro>	 (now)
[13:21:58] <taavi>	 yeah, the node is fine
[13:22:22] <dcaro>	 │ [2025-05-27 13:22:18 +0000] [1] [ERROR] Worker (pid:10) was sent SIGKILL! Perhaps out of memory?                                                                                                                                                                                                                                                    │
[13:22:25] <taavi>	 the web service still isn't responding
[13:22:30] <dcaro>	 from the logs of the web worker
[13:22:37] <taavi>	 huh
[13:23:07] <dcaro>	 it seems it's stabilized?
[13:23:35] <taavi>	 yes, and the UI loads now again
[13:24:52] <dcaro>	 :/, maybe the process uses a bump of memory on startup, and took a few rounds to start all the python workers?
[13:25:04] <taavi>	 I don't think we still have any clue why that failed in the first place
[13:25:36] <taavi>	 is something filling up redis causing that to go OOM or something?
[13:25:41] <taavi>	 would be cool to have any data on that
[13:26:20] <dcaro>	 yep, that was after you rebooted the node, the original cause is probably something else
[13:26:41] <dcaro>	 maybe redis, we are not getting any stats on the cluster are we?
[13:27:20] <taavi>	 nope!
[13:27:40] <taavi>	 I'd maybe set a memory limit on the redis pods, and if that turns nodes freezing into the pod OOM killing then we have a good suspect
[13:30:05] <dcaro>	 I mean, if the node froze, any pods in it would potentially misbehave right?
[13:30:27] <dcaro>	 but why did it freeze?
[13:32:41] <taavi>	 i think it running out of memory would match what we're seeing
[13:32:53] <taavi>	 but again, we have no real visibility inside the cluster
[13:33:16] <dcaro>	 ack
[13:42:49] <andrewbogott>	 Can k8s be configured to notice/kill unresponsive pods? Or are they unresponsive in a way that still passes the health check?
[13:44:47] * andrewbogott is off today but can't resist making unhelpful suggestions
[13:46:18] <taavi>	 it can, but that requires the node to be responsive enough to do that
[13:47:00] <taavi>	 as i said quarry could use resource limits on the pods
[13:47:57] <andrewbogott>	 So... the VM itself is oom?
[13:48:55] <taavi>	 kubelet's not responding to any requests. magnum is a black box which doesn't give any data whether that's an OOM or something else, but that's a likely theory
[13:50:04] <andrewbogott>	 :/ ok
[13:50:41] <andrewbogott>	 So yeah, then limiting the pods seems like the next step. That and maybe just growing the whole cluster. Since this is a newish issue it seems likely to be due to increased use
[15:21:12] <dcaro>	 quick review for lima-kilo restartability https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245
[15:21:27] <dcaro>	 (/me spent some time today restarting it to test stuff xd)
[15:28:25] <dhinus>	 dcaro: +1d
[15:28:35] <dcaro>	 thanks!
[16:37:08] <dcaro>	 prometheus cert update patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151264 (current alerts)
[16:37:55] <taavi>	 dcaro: tag with https://phabricator.wikimedia.org/T395227?
[16:38:06] <dcaro>	 🤦‍♂️
[16:39:55] <dcaro>	 done
[16:40:22] <taavi>	 ship it
[16:41:08] <dcaro>	 thanks!
[16:45:04] <taavi>	 thanks for dealing with that, I've been procrastinating
[16:47:34] <dcaro>	 me too during the afternoon... I have 3 tabs with the runbook open from seeing the alert, deciding to do it, clicking the runbook and getting distracted xd
[16:48:44] <dcaro>	 dhinus: thanks for asking in the k8s sig
[16:51:06] <dhinus>	 np, I had more questions but there was not much time... I might try the IRC channel
[17:02:24] * dhinus offline
[17:05:21] * dcaro off cya in a week and something!