[00:02:16] <bd808>	 Looking at the age of the current worker nodes I guess I'm not surprised that we need more added. It looks like the majority of the fleet was built over 3 years ago. There are just a handful of nodes that we built in the last year.
[00:02:17] <taavi>	 it seems like at least part of that load is cronjob launches piling up
[00:02:44] <taavi>	 first new node joined the cluster and immediately filled up
[00:03:37] <taavi>	 bd808: mind filing a task to add monitoring for cluster capacity?
[00:03:47] <bd808>	 will do
[00:07:22] <bd808>	 T352581
[00:07:23] <stashbot>	 T352581: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581
[00:10:50] <bd808>	 ouch. looks like we peaked at 444 pending containers -- https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1&forceLogin&from=now-3h&to=now&viewPanel=19
[00:10:52] <taavi>	 it seems like the backlog is going back down. I'll let the remaining cookbook runs complete, for a total of 6 new nodes (89-94) with 48 cores / 96 GiB
[00:13:38] <bd808>	 wmopbot's pod is still loosing the scheduler lottery
[00:13:39] <JJMC89>	 in case relevant, one of my cronjobs abuptly ended (code 255) around 2200
[00:13:59] <JJMC89>	 *abruptly
[00:14:39] <taavi>	 JJMC89: that sounds like the cluster-wide reboot that might have triggered the entire capacity issue
[00:15:53] <bd808>	 danilo: it finally got a slot to start! Sorry about that, but thank you for asking which led to this debugging session.
[00:16:51] <bd808>	 taavi: just a thought to mull over, we might be well served to use taints to separate CronJob things from other things.
[00:20:46] <taavi>	 hm. I don't immediately get the benefits of that?
[00:21:45] <bd808>	 My thought was that it might keep cronjobs from starving out webservices and manual bots
[00:22:30] <bd808>	 I continue to assume that most periodic crons can wait an hour to run if necessary, but webservices are less flexible
[00:23:06] <taavi>	 fair
[00:23:24] <bd808>	 on the grid engine we separated web into its own queues for similar reasons
[00:24:15] <taavi>	 the newest worker (tools-k8s-worker-94) did not instantly get filled!
[00:26:10] <bd808>	 pending state pods is down to 6!
[00:31:19] <danilo>	 thank you bd808 and taavi
[00:31:50] <bd808>	 we are definitely having some thundering herd things from k8s cron. At the half hour there was a spike of 60+ pending again, but this time they found exec nodes rather quickly
[00:31:56] <bd808>	 yw danilo 
[00:33:34] * bd808 wanders away to shovel snow
[00:39:04] <proc>	 different error when creating cluster now
[00:39:07] <proc>	 `Failed to create trustee or trust for Cluster: 75bdd68e-0b27-41ed-b50c-48aeaf17ea9a`
[10:03:34] <Wurgl>	 Please, can someone kill job 3808658 "job2" from tool wikihistory (still using jstart)
[10:12:59] <taavi>	 Wurgl: done
[10:13:13] <Wurgl>	 thx
[10:15:55] <Wurgl>	 BTW: Is there somewhere a kind of cooking receipt (for dummies) how to build a container with mono & php?
[10:16:45] <Wurgl>	 And Q2: Is there a replacement for jlocal? I use this for a watchdog process restarting the webserver
[10:28:59] <taavi>	 no, there's no such thing as `jlocal`. the concept doesn't really map to kubernetes, and kubernetes is much better at not losing track of running things so in most cases replacing jlocal is not necessary
[11:42:10] <proc>	 I presume wikimedia cloud sysadmins/staff are away today? (since it's the weekend?)
[12:10:53] <HolidayRhino>	 proc: probably better to ask your question and they'll answer on Monday or if they have spare time
[12:11:22] <HolidayRhino>	 taa.vi is staff and was around 2 hours ago
[12:11:42] <proc>	 ah I did ask it yesterday, but my IRC bouncer will disconnect so I might've missed the answer (or will miss it if it's answered much later)
[12:11:46] <proc>	 I can ask again Monday I suppose
[12:12:16] <HolidayRhino>	 proc: cloud-l probably better then
[12:12:25] <HolidayRhino>	 The mailing list will work
[12:27:35] <taavi>	 !log admin powercycle cloudvirt1063 T352595
[12:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[12:27:41] <stashbot>	 T352595: NodeDown - https://phabricator.wikimedia.org/T352595
[13:42:56] <bd808>	 !log tools.bridgebot Restarted bot. Was missing from IRC channels (possibly all channels). Likely fallout from Kubernetes cluster issues yesterday.
[13:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[13:45:21] <wm-bb>	 <bd808> Looks like bridgebot died yesterday. I’ll check on it.
[13:46:29] <bd808>	 heh. that's neat. The bot found and relayed a Telegram message from the time it was disconnected. Too bad the irc bouncer didn't do that the other direction.
[13:48:36] <bd808>	 The Telegram side of this channel missed messages starting from "[21:22:20]" in https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20231201.txt and continuing until "[13:42:56]" in https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20231202.txt
[13:52:29] <bd808>	 The TL;DR for that period is that a lot of Kubernetes workloads were delayed because of insufficient Kubernetes worker capacity. After adding 6 new worker nodes things seemed to catch up. T352581 is a followup task to add some monitoring for k8s cluster capacity issues.
[13:52:30] <stashbot>	 T352581: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581
[14:15:53] <taavi>	 bd808: I added a few more nodes this morning. See cloud-admin@ for a summary.
[14:17:52] <taavi>	 (we'd probably be fine without the ones from this morning, but I don't want it all to break when all of us are on a plane)
[14:19:53] <bd808>	 taavi: seems like good thinking. I just tweaked the dashboard to put CPU and RAM allocations on separate graphs so it is easier to read what is happening there.
[20:29:48] <wm-bb>	 <Sam_wi> https://t.me/wmtelegram_bot