[13:28:29] <taavi>	 !log clouddb-services restarting haproxy on clouddb-wikireplicas-proxy to kill long-lived mariadb connections
[13:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Clouddb-services/SAL
[17:15:41] <wm-bot>	 !log tools.lexeme-forms <lucaswerkmeister> deployed 54a614fd41 (fix some spacing)
[17:15:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL
[21:44:31] <RichSmith>	 Hi all, CBNG's instance seems to be stuck on 'Terminating', how can one kill it?
[22:00:06] <mutante>	 RichSmith: can you still ssh to it nor nah?
[22:01:53] <RichSmith>	 mutante: Not sure, I don't have a clue when to comes to Kube lol
[22:03:11] <RichSmith>	 Warning  FailedKillPod  47s (x1026 over 12h)  kubelet  error killing pod: failed to "KillContainer" for "bot" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: 0e3d4e9bd10f4be3ca14f26b3ddebbf93ffe64a51aedce75cba6fe4ce52ef84f: tried to kill container, but did not receive an exit event"
[22:03:19] <RichSmith>	 Do see that in the describe tho
[22:04:33] <mutante>	 RichSmith: Sorry, I misunderstood. I thought when you said "instance" it is about a VM, a cloud VPS project
[22:04:46] <mutante>	 seems like you are in Toolforge
[22:04:55] <mutante>	 CBNG = cluebot ?
[22:05:33] <mutante>	 I think the people appearing in this log might be able to help: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng/SAL
[22:12:49] <RichSmith>	 mutante: Ah, sorry, yea, it's Toolforge, and yes, it's ClueBot
[22:17:33] <RichSmith>	 bd808: Can you help?
[22:18:04] * bd808 reads backscroll
[22:18:41] <bd808>	 hmmm... that kind of pod "sticking" I think has meant the k8s node was in trouble in the past.
[22:19:00] <bd808>	 RichSmith: what is the tool name? cluebot?
[22:19:16] <RichSmith>	 bd808: cluebotng
[22:20:25] * bd808 is poking at things
[22:26:20] <bd808>	 !log tools Cordon, drain, and restart tools-k8s-worker-81. Instance appears to have pods from tools.cluebotng that are unresponsive to kubectl commands.
[22:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[22:28:32] <bd808>	 fun times waiting for the k8s core to give up on evicting the unresponsive pods. :thumb-twiddle:
[22:31:34] <bd808>	 everything other than the cluebotng pod and demonset things are off the node. k8s is still trying to evict the cluebotng pod that is quite obviously hung badly.
[22:32:06] <RichSmith>	 Thanks for this bd808
[22:33:19] <bd808>	 !log tools Soft reboot of tools-k8s-worker-81
[22:33:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[22:35:16] <bd808>	 !log tools Hard reboot of tools-k8s-worker-81
[22:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[22:37:30] <bd808>	 RichSmith: You can try to start things up again now
[22:38:37] <RichSmith>	 bd808: Ok, will do, thanks
[22:46:39] <RichSmith>	 bd808: It's getting hit by https://phabricator.wikimedia.org/T352055
[22:46:47] <RichSmith>	 Warning   FailedCreate           replicaset/cbng-6985548c8b   Error creating: pods "cbng-6985548c8b-hfxdv" is forbidden: exceeded quota: tool-cluebotng, requested: requests.memory=4452Mi, used: requests.memory=0, limited: requests.memory=4Gi
[22:48:25] <bd808>	 RichSmith: ah. taavi has a plan to fix that I think, but you probably need help sooner rather than later. let me see if I can find the reminder docs about how to bump cluebotng's specific quota.
[22:48:36] <RichSmith>	 bd808: Cheers
[22:51:03] <bd808>	 https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Quota_management is apparently the magic.
[22:56:41] <RichSmith>	 bd808: Damian got it going by bumping its memory down, he said it should be fine
[22:57:33] <bd808>	 RichSmith: excellent. I was fumbling through trying to give you more quota, but if y'all can wait for t.aavi to make the better fix that's perfect.
[22:58:14] <RichSmith>	 Yea, that's fine, Damian did note though: 'I think the MySQL limits are actually what limit the bot more than quota, we just occasionally get OOMs'
[23:01:15] <bd808>	 RichSmith: you can request higher RAM quota than the defaults if it seems like it might help -- <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Quota_increases>. There is a similar process to ask for more db connections too -- https://phabricator.wikimedia.org/project/view/4481/
[23:11:18] <bd808>	 !log tools Drained and hard rebooted tools-k8s-worker-40. K8s was showing inconsistent status of the node (offline per k8s-status tool, online per kubectl)
[23:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL