[00:27:14] * bd808 off [07:42:14] * Reedy waits for horizon to load [08:04:45] that worker is the one that was stuck on NFS and I restarted earlier [08:16:16] ? [08:20:09] tools-k8s-worker-nfs-56, [08:20:36] (from the comment before) [08:21:34] I must have lost something [08:22:00] https://usercontent.irccloud-cdn.com/file/A1PvhD9s/image.png [08:22:00] ` Warning FailedMount 29s (x7 over 111s) kubelet MountVolume.SetUp failed for volume "kube-api-access-4hdj9" : [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:tools-k8s-worker-nfs-56" cannot create resource "serviceaccounts/token" in API group "" in the namespace "tool-wikibugs-testing": no relationship found between node 'tools-k8s-worker-nfs-56' and this object, failed to sync [08:22:00] configmap [08:22:11] earlier [08:22:22] yesterday afternoon [08:22:23] oh ok! [08:22:28] yes, I see that [09:54:46] * arturo reimaging laptop [10:16:52] this thing with debian, with 10000 iso images options to download, and 9999 of them missing firmware, it is ridiculous [10:17:35] I think the one that the big download button on https://www.debian.org/ links to has the firmware these days? [10:17:48] but is stable [10:17:50] I want testing [10:18:31] i've always just installed stable and then upgraded to testing [10:18:46] on the other hand, the testing installer image I just tried is 1) missing firmware for the wifi card, 2) affected by debian bug #1067831 [10:19:31] I might need to do the upgrade this time [10:46:26] not sure what you mean, the default testing netinst images available at https://cdimage.debian.org/cdimage/daily-builds/daily/arch-latest/amd64/iso-cd/ do include firmware? [10:46:59] which are the default one is being pointed to if one selects testing [10:47:20] but yeah, with the t64 mess it's a poor time to install testing possibly :-) [10:50:47] moritzm: the installer warned me about missing iwlwifi firmware :-( it is true I did not use the daily installer build, but a weekly one I think [10:52:08] https://saimei.ftp.acc.umu.se/cdimage/daily-builds/daily/arch-latest/amd64/iso-cd/debian-testing-amd64-netinst.iso <-- no, it was daily [11:01:22] ah, iwlwifi is still special I think [11:01:45] IIRC it cannot be reistributed unless once accepts the EULA in debconf or similar crap [11:02:39] I see [11:02:54] but also not for all cards, my Lenovo X1 also has some wifi card managed by iwlwifi, but I could install it via regular d-i via wifi [11:59:06] arturo: are you planning to release a new version of jobs-cli or should I go for it? [11:59:19] (with the healthceck fix) [12:02:41] dcaro: mmm [12:02:58] sent https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/25 (being proactive xd) [12:03:18] thanks, please you do it [12:03:24] I'm still fighting the laptop [12:03:44] 👍 [12:04:13] this patch LGTM, thanks! [12:43:22] alerts.wikimedia.org can now silence metricsinfra alerts [12:47:06] taavi: 🎉 good work [12:47:41] and silencing metricsinfra alerts from cookbooks is basically just pending a new spicerack release [12:58:47] * arturo food time [13:24:34] \o/ [14:12:39] hm, I'm seeing errors when deploying jobs-api on worker-nfs-52, it fails to mount the secrets [14:13:10] is that the same worker that was rebooted yesterday? [14:13:15] MountVolume.SetUp failed for volume "kube-api-access-7zjsd" : failed to sync configmap cache: timed out waiting for the condition [14:13:20] no, it's a different one [14:19:58] I have no idea what that means [14:20:13] does it have good connectivity with etcd? [14:20:20] or the api server? [14:21:48] it's non-responsive, trying the console [14:22:08] how are the D procs? [14:22:23] it does not show in the graphs [14:23:26] wait it's responding [14:25:42] hmm, one of the pods for jobs-api is actually running there [14:26:09] correctly it seems [14:26:13] maybe it's just a warning? [14:29:58] where did you see the message? [14:30:03] in `describe pod`? [15:02:12] get events -n jobs-api [15:46:58] * arturo offline [15:48:08] * dcaro off [15:51:58] Notes at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Monthly_meeting/2024-04-09. Please review and correct anything I summarized badly [16:02:38] thx bd808 [16:21:08] dcaro: is there any reason to scale up core/ram/cpu for the cloudcephmon refresh? Or are we good with what we have? (It looks like the default spec we would get has 2x the RAM) [16:25:42] andrewbogott: well, we don't have a problem right now, though having more memory allows for things like having a node down for longer (because it has to keep track of the things to shift around when it comes back), so it would be nicer, but not necessary [16:25:58] if it's not a big price jump, I think it's ok to use the new default [16:26:22] ok. Momentum seems to be towards scaling up so we can just go with that. [16:49:47] old busted: openstack on kubernetes; new hotness: kubernetes on kubernetes with PXE boot images for the bare metal hosts at the bottom of the stack, all managed via Helm [16:49:53] https://github.com/aenix-io/kubefarm [16:51:41] that doesn't seem terrible, although I'm always puzzled about people are thinking they'll use for a UI when migrating off of openstack to pure k8s. [16:56:41] Is it expected that https://prometheus-alerts.wmcloud.org/ is down? [16:57:51] Kubefarm is from a hosting company, so I imagine they were already pretty deeply invested in their own management console. Horizon is sort a PoC system already for exposing OpenStack management to tenants. Blog post on the project at https://kubernetes.io/blog/2021/12/22/kubernetes-in-kubernetes-and-pxe-bootable-server-farm/ [16:59:21] (found while web searching for the truly cursed idea of running k8s from UEFI) [17:00:58] I don't necessarily mean web UI, I just mean UI at all. K8s has an API but surely no public clouds are exposing the APIs needed to create your own cluster, right? [17:01:08] (I mean, except via custom per-cloud web ui) [17:01:47] that new alert is me, likely I just need to rerun puppet on the checker host [18:23:05] * bd808 lunch [23:41:27] * bd808 off