[07:19:46] okok cloudcephosd1006 is up and running [08:31:38] dcaro: found a little bug in the phabricator plugin for terminator https://www.irccloud.com/pastebin/Q8L2jryh/ it is missing a `T` character in the destination URL [08:31:53] oh yes, solved it right after xd [08:33:08] 👍 [08:34:42] I'm trying now to get it to highlight the link without mouse interaction, so with keyboard only [08:34:51] but that's trickier xd [08:36:26] maybe you can at least change the color of the string? I'm not even sure if that's possible [08:36:38] like make it bold, red, underscore, or something similar [08:37:44] yep you can, I'm having index issues (the link that's highlighted is different from the one that gets open) [08:40:12] I used one of the existing plugins for starters that already does something similar [08:53:43] dcaro: could you please stamp this one? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 [08:54:41] done [08:54:42] thanks! [08:55:11] heads up, deploying change to delete the per-tool-account PSP settings [08:56:03] in case you are wondering, there are at the moment 0 kyverno policy violations in the cluster [09:04:06] I deleted PSP from toolsbeta [09:04:12] but I cannot create jobs [09:04:13] Warning FailedCreate 13s (x4 over 63s) job-controller Error creating: pods "once-with-retry-" is forbidden: PodSecurityPolicy: unable to admit pod: [] [09:04:25] this is something I could not detect on my tests in lima-kilo [09:04:39] if the PSP controller is enabled, there must be a PSP to permit the pod creation [09:04:49] so the PSP controller needs to be disabled before deleting the PSP objects [09:05:31] I see [09:05:47] apparently I had the controller disabled in lima-kilo [09:18:07] taavi: I can manually edit the static pod manifests to un-load the PSP controller, but I have doubts about the kubeadm configmap. I want to make sure the next update don't add the controller again [09:18:53] I guess the question is: shall I just edit by hand the cm? [09:29:20] I'll do it [09:33:58] sorry, yes [09:35:14] everything looks good now, I'm running the functional tests in both toolsbeta & lima-kilo before proceeding with tools [09:46:35] heads up, disabling PodSecurityPolicy admission on tools [09:46:52] scary [09:46:58] :-S [09:49:10] done [09:49:18] now deleting per-tool-account PSP entries [09:52:38] jobs-api timed out for me in tools [09:52:58] now back online [09:53:26] taavi: could you please test random stuff in tools? [10:31:01] arturo: are you looking at 'Error: /Stage[main]/Kubeadm::Init_yaml/File[/etc/kubernetes/psp/base-pod-security-policies.yaml]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/kubeadm/psp/base-pod-security-policies.yaml'? [10:52:57] no [10:53:09] will fix now [12:40:30] We warned: today I'm going to upgrade the cloudvirtlocal hosts to ovs, which means juggling etcd nodes. May result in alerts about degraded states &c. [12:40:51] ok [12:41:48] *Be warned [12:45:35] ack [12:58:31] * arturo food [14:00:40] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/10 please :) [14:00:50] also https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1050368 [14:19:42] topranks: cloudcephosd1007 is having the same dhcp issues as 1006 had :/ [14:19:55] (when reimaging) [14:24:16] yep, same, I manually ran link up and udhcpd and it worked, then continued with the install, and the network stopped working again [14:25:27] ok… [14:25:56] I’d love to say I’d had a shower thought of what might be going on since we last spoke.. but no [14:26:07] I'm trying to rerun udhcpc, but it's failing now :/ [14:26:31] link still showing up? [14:27:08] yep [14:27:26] on the host at least [14:27:49] it says no carrier though [14:27:49] NO-CARRIER,BROADCAST,MULTICAST,UP [14:28:30] yeah that means it's down, but set to be up [14:28:33] bring it down and up again it came up [14:28:42] 2: enp175s0f0np0: mtu 1500 qdisc mq qlen 1000 [14:29:04] and now it worked the dhcp xd [14:29:49] and down again :/ [14:30:08] speed config issues? [14:30:30] switch side shows up [14:30:48] nah you won't get those with 10G modules, they can only do 10/full [14:30:59] it's up now again xd, manually flipped it off and on [14:31:07] only the older switches with RJ45 on the front you get that, the modular ones will stick to the speed the module inserted does [14:31:09] or not work at all [14:31:31] https://usercontent.irccloud-cdn.com/file/sCpDHRji/image.png [14:31:34] some pings are very slow [14:31:40] switch side does show it bouncing up and down a few times [14:32:44] oh, it has no default gateway now :/ [14:33:15] I had to run dhcp again after flipping it off and on [14:33:21] now it seems to work :l [14:33:38] this is no way to live [14:33:45] I think that something in the way that debian installer configures the network (even if it's already done) messes it up? [14:34:02] if I manually do that step and skip it seems to work [14:34:13] but yep, that's not nice (got ~40 hosts to go...) [14:34:28] and have to do it again relatively soon xd [14:34:45] maybe the installer for bookworm does not have those issues? [14:36:54] I'll stop and try to debug the next one if it happens again [14:38:34] taavi: did you manage to find the phab ticket about caching helm charts and container images? [14:42:45] dcaro: didn’t realise you were doing them all [14:42:54] yeah we’ll need to work it out [14:46:34] I do them little by little, but they all need upgrading (first to bullseye, then bookworm+ceph version upgrade) [14:47:33] yeah. ultimately I guess it's between dc-ops and i/f to work it out [14:47:45] I'll mention it Monday in our I/F meeting to make people aware and see if anyone has any ideas [14:49:47] ack, thanks, we'll see if it happens with the next, but it's two in a row so far [14:53:28] yeah if it was a one off we could maybe forget about it but looks like some new type of bug [14:53:44] next one maybe let’s get the firmware on the 100% known good one first [14:59:35] andrewbogott: sorry, was in meetings [14:59:53] no worries! I self-merged, it seems to be working [15:00:05] i guess the resize api will work fine for local flavors, it won't try to migrate things or anything [15:00:13] migrate them between nodes that is [15:00:34] actually it does! they must've finally fixed cold-migration [15:39:23] I missed your last meeting taavi :(( Thanks again for all the work and ideas and just being a cool person to interact with. [15:43:09] bd808: <3 [15:45:49] taavi: I just created T368630 if you find an old one, please merge [15:45:50] T368630: toolforge: make sure we cache in our repos/registries all helm charts and container images used in k8s - https://phabricator.wikimedia.org/T368630 [16:59:54] * dhinus off [17:01:02] I just updated https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Ongoing_Efforts/Toolforge_Upgrade_Workgroup/Upgrades_Overview#Upgrade_tasks_and_major_changes and assigned tasks to people, the order was pseudo-random (first in the list with last, then second, then second to last...), feel free to change, the next round I'll do a different order to mix pairs up [17:09:02] andrewbogott: I'm leaving cloudcephosd1007 getting added to the cluster, it will take some time, I'll re-check in a bit but if you see anything weird feel free to telegram me (or page me) [17:09:26] ok! [21:24:57] still rebalancing... might take a bit