[09:58:16] current status: https://prod-soju-upload.public.object.majava.org/taavi/XXojcjjCaADVwpPyfDqVeaLH-reimaging.png [10:16:08] :D [12:23:50] * dcaro around for the afternoon [13:15:18] quarry's apparently down again [13:17:16] quarry-127b-3lqizumia4xn-node-1 is somehow upset? [13:17:22] 👀 [13:18:14] nothing in the console, so rebooting [13:18:32] message: Kubelet stopped posting node status. [13:20:15] that doesn't seem to have helped [13:21:49] the node shows up [13:21:51] (now) [13:21:58] yeah, the node is fine [13:22:22] │ [2025-05-27 13:22:18 +0000] [1] [ERROR] Worker (pid:10) was sent SIGKILL! Perhaps out of memory? │ [13:22:25] the web service still isn't responding [13:22:30] from the logs of the web worker [13:22:37] huh [13:23:07] it seems it's stabilized? [13:23:35] yes, and the UI loads now again [13:24:52] :/, maybe the process uses a bump of memory on startup, and took a few rounds to start all the python workers? [13:25:04] I don't think we still have any clue why that failed in the first place [13:25:36] is something filling up redis causing that to go OOM or something? [13:25:41] would be cool to have any data on that [13:26:20] yep, that was after you rebooted the node, the original cause is probably something else [13:26:41] maybe redis, we are not getting any stats on the cluster are we? [13:27:20] nope! [13:27:40] I'd maybe set a memory limit on the redis pods, and if that turns nodes freezing into the pod OOM killing then we have a good suspect [13:30:05] I mean, if the node froze, any pods in it would potentially misbehave right? [13:30:27] but why did it freeze? [13:32:41] i think it running out of memory would match what we're seeing [13:32:53] but again, we have no real visibility inside the cluster [13:33:16] ack [13:42:49] Can k8s be configured to notice/kill unresponsive pods? Or are they unresponsive in a way that still passes the health check? [13:44:47] * andrewbogott is off today but can't resist making unhelpful suggestions [13:46:18] it can, but that requires the node to be responsive enough to do that [13:47:00] as i said quarry could use resource limits on the pods [13:47:57] So... the VM itself is oom? [13:48:55] kubelet's not responding to any requests. magnum is a black box which doesn't give any data whether that's an OOM or something else, but that's a likely theory [13:50:04] :/ ok [13:50:41] So yeah, then limiting the pods seems like the next step. That and maybe just growing the whole cluster. Since this is a newish issue it seems likely to be due to increased use [15:21:12] quick review for lima-kilo restartability https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/245 [15:21:27] (/me spent some time today restarting it to test stuff xd) [15:28:25] dcaro: +1d [15:28:35] thanks! [16:37:08] prometheus cert update patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151264 (current alerts) [16:37:55] dcaro: tag with https://phabricator.wikimedia.org/T395227? [16:38:06] 🤦‍♂️ [16:39:55] done [16:40:22] ship it [16:41:08] thanks! [16:45:04] thanks for dealing with that, I've been procrastinating [16:47:34] me too during the afternoon... I have 3 tabs with the runbook open from seeing the alert, deciding to do it, clicking the runbook and getting distracted xd [16:48:44] dhinus: thanks for asking in the k8s sig [16:51:06] np, I had more questions but there was not much time... I might try the IRC channel [17:02:24] * dhinus offline [17:05:21] * dcaro off cya in a week and something!