[13:28:29] !log clouddb-services restarting haproxy on clouddb-wikireplicas-proxy to kill long-lived mariadb connections [13:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL [17:15:41] !log tools.lexeme-forms deployed 54a614fd41 (fix some spacing) [17:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [21:44:31] Hi all, CBNG's instance seems to be stuck on 'Terminating', how can one kill it? [22:00:06] RichSmith: can you still ssh to it nor nah? [22:01:53] mutante: Not sure, I don't have a clue when to comes to Kube lol [22:03:11] Warning FailedKillPod 47s (x1026 over 12h) kubelet error killing pod: failed to "KillContainer" for "bot" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: 0e3d4e9bd10f4be3ca14f26b3ddebbf93ffe64a51aedce75cba6fe4ce52ef84f: tried to kill container, but did not receive an exit event" [22:03:19] Do see that in the describe tho [22:04:33] RichSmith: Sorry, I misunderstood. I thought when you said "instance" it is about a VM, a cloud VPS project [22:04:46] seems like you are in Toolforge [22:04:55] CBNG = cluebot ? [22:05:33] I think the people appearing in this log might be able to help: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL [22:12:49] mutante: Ah, sorry, yea, it's Toolforge, and yes, it's ClueBot [22:17:33] bd808: Can you help? [22:18:04] * bd808 reads backscroll [22:18:41] hmmm... that kind of pod "sticking" I think has meant the k8s node was in trouble in the past. [22:19:00] RichSmith: what is the tool name? cluebot? [22:19:16] bd808: cluebotng [22:20:25] * bd808 is poking at things [22:26:20] !log tools Cordon, drain, and restart tools-k8s-worker-81. Instance appears to have pods from tools.cluebotng that are unresponsive to kubectl commands. [22:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:28:32] fun times waiting for the k8s core to give up on evicting the unresponsive pods. :thumb-twiddle: [22:31:34] everything other than the cluebotng pod and demonset things are off the node. k8s is still trying to evict the cluebotng pod that is quite obviously hung badly. [22:32:06] Thanks for this bd808 [22:33:19] !log tools Soft reboot of tools-k8s-worker-81 [22:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:35:16] !log tools Hard reboot of tools-k8s-worker-81 [22:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:37:30] RichSmith: You can try to start things up again now [22:38:37] bd808: Ok, will do, thanks [22:46:39] bd808: It's getting hit by https://phabricator.wikimedia.org/T352055 [22:46:47] Warning FailedCreate replicaset/cbng-6985548c8b Error creating: pods "cbng-6985548c8b-hfxdv" is forbidden: exceeded quota: tool-cluebotng, requested: requests.memory=4452Mi, used: requests.memory=0, limited: requests.memory=4Gi [22:48:25] RichSmith: ah. taavi has a plan to fix that I think, but you probably need help sooner rather than later. let me see if I can find the reminder docs about how to bump cluebotng's specific quota. [22:48:36] bd808: Cheers [22:51:03] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Quota_management is apparently the magic. [22:56:41] bd808: Damian got it going by bumping its memory down, he said it should be fine [22:57:33] RichSmith: excellent. I was fumbling through trying to give you more quota, but if y'all can wait for t.aavi to make the better fix that's perfect. [22:58:14] Yea, that's fine, Damian did note though: 'I think the MySQL limits are actually what limit the bot more than quota, we just occasionally get OOMs' [23:01:15] RichSmith: you can request higher RAM quota than the defaults if it seems like it might help -- . There is a similar process to ask for more db connections too -- https://phabricator.wikimedia.org/project/view/4481/ [23:11:18] !log tools Drained and hard rebooted tools-k8s-worker-40. K8s was showing inconsistent status of the node (offline per k8s-status tool, online per kubectl) [23:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL