[00:02:16] Looking at the age of the current worker nodes I guess I'm not surprised that we need more added. It looks like the majority of the fleet was built over 3 years ago. There are just a handful of nodes that we built in the last year. [00:02:17] it seems like at least part of that load is cronjob launches piling up [00:02:44] first new node joined the cluster and immediately filled up [00:03:37] bd808: mind filing a task to add monitoring for cluster capacity? [00:03:47] will do [00:07:22] T352581 [00:07:23] T352581: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 [00:10:50] ouch. looks like we peaked at 444 pending containers -- https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1&forceLogin&from=now-3h&to=now&viewPanel=19 [00:10:52] it seems like the backlog is going back down. I'll let the remaining cookbook runs complete, for a total of 6 new nodes (89-94) with 48 cores / 96 GiB [00:13:38] wmopbot's pod is still loosing the scheduler lottery [00:13:39] in case relevant, one of my cronjobs abuptly ended (code 255) around 2200 [00:13:59] *abruptly [00:14:39] JJMC89: that sounds like the cluster-wide reboot that might have triggered the entire capacity issue [00:15:53] danilo: it finally got a slot to start! Sorry about that, but thank you for asking which led to this debugging session. [00:16:51] taavi: just a thought to mull over, we might be well served to use taints to separate CronJob things from other things. [00:20:46] hm. I don't immediately get the benefits of that? [00:21:45] My thought was that it might keep cronjobs from starving out webservices and manual bots [00:22:30] I continue to assume that most periodic crons can wait an hour to run if necessary, but webservices are less flexible [00:23:06] fair [00:23:24] on the grid engine we separated web into its own queues for similar reasons [00:24:15] the newest worker (tools-k8s-worker-94) did not instantly get filled! [00:26:10] pending state pods is down to 6! [00:31:19] thank you bd808 and taavi [00:31:50] we are definitely having some thundering herd things from k8s cron. At the half hour there was a spike of 60+ pending again, but this time they found exec nodes rather quickly [00:31:56] yw danilo [00:33:34] * bd808 wanders away to shovel snow [00:39:04] different error when creating cluster now [00:39:07] `Failed to create trustee or trust for Cluster: 75bdd68e-0b27-41ed-b50c-48aeaf17ea9a` [10:03:34] Please, can someone kill job 3808658 "job2" from tool wikihistory (still using jstart) [10:12:59] Wurgl: done [10:13:13] thx [10:15:55] BTW: Is there somewhere a kind of cooking receipt (for dummies) how to build a container with mono & php? [10:16:45] And Q2: Is there a replacement for jlocal? I use this for a watchdog process restarting the webserver [10:28:59] no, there's no such thing as `jlocal`. the concept doesn't really map to kubernetes, and kubernetes is much better at not losing track of running things so in most cases replacing jlocal is not necessary [11:42:10] I presume wikimedia cloud sysadmins/staff are away today? (since it's the weekend?) [12:10:53] proc: probably better to ask your question and they'll answer on Monday or if they have spare time [12:11:22] taa.vi is staff and was around 2 hours ago [12:11:42] ah I did ask it yesterday, but my IRC bouncer will disconnect so I might've missed the answer (or will miss it if it's answered much later) [12:11:46] I can ask again Monday I suppose [12:12:16] proc: cloud-l probably better then [12:12:25] The mailing list will work [12:27:35] !log admin powercycle cloudvirt1063 T352595 [12:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:27:41] T352595: NodeDown - https://phabricator.wikimedia.org/T352595 [13:42:56] !log tools.bridgebot Restarted bot. Was missing from IRC channels (possibly all channels). Likely fallout from Kubernetes cluster issues yesterday. [13:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [13:45:21] Looks like bridgebot died yesterday. I’ll check on it. [13:46:29] heh. that's neat. The bot found and relayed a Telegram message from the time it was disconnected. Too bad the irc bouncer didn't do that the other direction. [13:48:36] The Telegram side of this channel missed messages starting from "[21:22:20]" in https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20231201.txt and continuing until "[13:42:56]" in https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud/20231202.txt [13:52:29] The TL;DR for that period is that a lot of Kubernetes workloads were delayed because of insufficient Kubernetes worker capacity. After adding 6 new worker nodes things seemed to catch up. T352581 is a followup task to add some monitoring for k8s cluster capacity issues. [13:52:30] T352581: Monitoring and alerting is needed for Kubernetes cluster capacity - https://phabricator.wikimedia.org/T352581 [14:15:53] bd808: I added a few more nodes this morning. See cloud-admin@ for a summary. [14:17:52] (we'd probably be fine without the ones from this morning, but I don't want it all to break when all of us are on a plane) [14:19:53] taavi: seems like good thinking. I just tweaked the dashboard to put CPU and RAM allocations on separate graphs so it is easier to read what is happening there. [20:29:48] https://t.me/wmtelegram_bot