[07:01:10] morning [07:02:07] is someone working on tools-harbor-2? it's sending puppet emails because the puppet cert dance wasn't done [07:11:29] greetings [07:46:47] morning, I think Raymond_Ndib.e might be doing something there [07:53:12] cloudgw1003 had some kernel errors it seems, from the logs it looks like the network drivers had some issue [07:53:22] `[Sun Aug 10 14:37:56 2025] NETDEV WATCHDOG: eno12399np0 (bnxt_en): transmit queue 7 timed out` [07:55:49] T401549 [07:55:50] T401549: KernelErrors - https://phabricator.wikimedia.org/T401549 [07:56:38] 👍 [08:30:06] what's the recommended way to run a root command on tools workers ? I'm looking at T400223 [08:30:07] T400223: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223 [08:31:09] godog: you have cloudcumin1001 that can run cumin on VMs [08:31:19] dcaro: thank you! will use that [08:32:10] something like `cumin 'O{project:tools name:tools-k8s-worker}' ...` [08:32:22] or by puppet class and such [08:32:47] nope, can't use puppetdb queries there [08:35:40] https://www.irccloud.com/pastebin/btb19LB0/ [08:35:45] ahh, for the puppet class, yep [08:35:50] sorry, no puppet-class filtering [08:36:17] ok thank you, yes I can live with that [08:43:49] hmm... tools-harbor-2 is complaining about self-signed certs [08:44:27] as I said earlier, it's because the puppet cert dance wasn't done, it's trying to validate the toolforge puppetserver cert against the generic cloud vps puppet server ca [08:44:38] so basically just needs to refresh puppet certs cookbook to be ran against it [08:45:01] yep [08:45:02] on it [08:46:17] andrewbogott: are you aware the clinic duty rotation says it's your turn? [08:47:35] I think it might have slipped he's mind, it was an unusual team meeting [08:48:24] the error now is failing to install `docker-ce`, looking [08:53:33] I think that the docker-ce repo we are using was removed [08:53:53] docker-ce is the package from docker, not from debian itself, right? [08:55:19] I think so yes [08:55:38] I think we should use https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146091 instead [08:55:50] (the replacement for docker-upstream provided packages) [08:56:00] hmm [08:56:11] so tools-harbor-1 was bullseye and -2 is now bookworm [08:56:42] bullseye and bookworm both have docker 20 which is kind of old, and I guess harbor is using some feature only in the newer versions? [08:57:01] trixie (which is now released) has docker 26, so we could give that a go as well now [08:58:01] to be fair, I'm not sure that harbor is using any special feature, so might as well work with the debian provided one [08:58:48] for trixie, we have to provided a base image and such right? [08:59:02] uh not sure what you mean by that? [08:59:19] like the base VM image [08:59:30] (in openstack) [09:00:13] yeah, by and looking at the puppet history it seems like a.nderew is already on it [09:16:36] that'd be nice to have, I would not block on it though [09:47:12] sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177330 , should be the same if/when we move to trixie [09:49:16] +1 [11:01:33] quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177333 [11:02:57] taavi: +1d [11:03:22] thank you [12:55:51] I hoped to have Trixie ready already but there's at least one wrinkle I don't understand yet. [12:56:06] need any help with that? [12:56:09] And, I didn't know I was on clinic duty! So thanks for the nudge, is there anything specific that I should deal with immediately? [12:57:16] there's a couple of tasks in the clinic duty workboard column, and a project request, that has been sitting there for a while. but nothing unusual [12:57:22] ok! [12:58:01] I don't think I need help with Trixie yet, the next thing I need to do is just download/set up a raw upstream image so I can figure out what the deal is with the hostname which seems to be the source of the problem. [13:00:18] I have also learned that late at night my eyes are not great at noticing the difference between debian-13-genericcloud-amd64.tar.xz and debian-13-genericcloud-arm64.tar.xz but I have managed to overcome that particular issue I think :) [13:21:13] well great, cloud-init on Trixie doesn't seem to install the private key for direct access either [13:22:04] do you have a test vm i can have a look at? [13:22:53] You can make your own in codfw1dev testlabs [13:23:03] (I say make your own because you'll want to include your own keypair) [13:23:47] btw, what's the status of disabling the legacy network? I would very much like to avoid having any new trixie vms in the old vlan network [13:25:21] https://phabricator.wikimedia.org/T399127#10989805 [13:25:39] in short: I don't know how to disable it [13:30:57] hmm [13:32:56] yeah. We can always hack something into the API but I've been waiting for a better idea to present itself [13:41:52] andrewbogott: I think I found a "fun" puppet versioning issue which explains the "Cert hostname does not match reported hostname" issue I'm seeing [13:42:11] so you were able to get an ssh key to work? [13:42:32] didn't bother trying, I just logged in via the serial console [13:42:51] is there a default username/password? I thought it was ssh only these days [13:43:14] btw, tracking task T401584 [13:43:14] T401584: Create debian 13.0 Trixie base images in cloud-vps - https://phabricator.wikimedia.org/T401584 [13:43:30] our cloud-init config logs the console in as root by default? [13:44:10] oh yeah, it does if it finishes! I must've been trying on one that was stuck halfway through [13:45:26] so, puppet versioning issue? [13:45:48] yes, I filed T401586 to track that since I don't think the particular case I found is going to be the only instance of this [13:45:49] T401586: Fix Puppet version/legacy fact issues with Cloud VPS Trixie image - https://phabricator.wikimedia.org/T401586 [13:46:08] (note that the 'does not match reported hostname' check is something we added so we could potentially just... not check that. But I assume something bad is happening to cause that) [13:46:16] anyway, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177399/ [13:46:41] oh nice [13:47:16] but surely we can have cloud-init downgrade to 7 before we even get that far... [13:47:19] also for some reason I needed to manually specify `--certname taavirixie-puppet.testlabs.codfw1dev.wikimedia.cloud` for puppet to run properly here, not sure what's going on with that [13:47:31] it certainly can install custom apt repos although I'm not sure it does that currently [13:48:02] we could, but I'm hopeful we can just fix our code instead, the base layer needed to set up a VM isn't that big and then puppet can take it from there [13:48:07] do you need to specify that every time or just the first time? In theory puppet embeds the certname in puppet.conf after the first run... I think... [13:48:48] not sure, I haven't had a completed run here yet [13:48:54] are you confident that ${facts['networking']['hostname'] is present with puppet 7? I guess we don't really need it to be present in puppet 5... [13:48:59] yes [13:49:16] oh you're fast, I was still testing it [13:49:22] that works, but brings a new similar issue [13:50:42] it seemed harmless at worst! [13:50:57] But, I'm going to grab a bite of breakfast so I won't prematurely merge anything for a few minutes [13:50:58] that's fair [13:51:40] meanwhile I'm going to see if I can get this working with a reasonable number of patches or whether we should look at other options [13:55:36] I did some trixie work in cloud vps a while back, I'm looking for my branch [13:55:46] part of that was T391083 [13:55:48] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [13:57:25] taavi: want me to see about downgrading the puppet package during setup, or should I leave you to it and work on something else? [13:58:34] there's a bunch of patches here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/sandbox/filippo/pontoon-trixie basically to go over structured facts [13:59:15] some I'm sure are obsolete now [13:59:28] godog: so you want to fully support the puppet 8 client rather than just not use it? [14:00:47] I honestly would go for puppet 8 by default if we can [14:01:09] and the patches I just sent in T401586 seem to make everything compile with no relevant warnings (there's one about hardware but that's not relevant to us) [14:01:10] T401586: Fix Puppet version/legacy fact issues with Cloud VPS Trixie image - https://phabricator.wikimedia.org/T401586 [14:01:55] andrewbogott: I aimed at supporting 8 as much as I could, with backwards compat as reasonable [14:02:20] though yes IIRC prod ships puppet 7 with trixie now [14:02:45] andrewbogott: want to give the puppet image builder script a try? I have those patches cherry-picked locally on the codfw1dev puppetmaster so it should(TM) work fine [14:03:06] sure, I'll start a new build [14:03:40] godog: sorry for duplicating your work, didn't know that you'd already looked into this [14:04:18] taavi: no worries at all, they are all quite simple patches [14:04:20] taavi: the build is running; in a minute the initial VM will start up in the 'admin' project if you want to watch the log. [14:08:17] unrelated: I have not yet investigated the kernel errors on the cloudgw hosts; has anyone? [14:10:37] taavi: did you see/understand these "'/usr/bin/apt-get update ' returned 100 instead of" lines? [14:11:30] andrewbogott: I can see /something/ went wrong, but the history horizon is giving me doesn't have enough scrollback to see what exactly happened [14:11:42] yeah, it's infuriating that it truncates the logs [14:12:07] I think it might be the lack of openstack bpo for trixie [14:13:06] hmmm, didn't we disable osbpo by default on VMs? [14:13:41] yeah, we did, in puppet, let me make sure there's not a remnant in the cloud-init setup [14:14:45] nope [14:14:55] so it must just be some packages that aren't available in trixie yet [14:16:39] a missing package wouldn't cause issues with apt update [14:17:40] yeah, sorry, I switched topics there without telling you :) [14:17:51] huh? [14:18:03] * taavi is just even more confused [14:18:12] Just, there were two different things in the logs, the thing about apt failing and also some issues about missing packages that puppet couldn't install [14:18:19] I was thinking they were the same issue but I no longer think that [14:19:35] I think we can ignore those apt things for the moment and still get a mostly-complete base image. But there's another issue with cloud-init output that I need to track down [14:20:17] a vm without a successful apt update at all would definitely explain the "Unable to locate package" apt errors [14:20:30] andrewbogott: re: kernelerrors in cloudgw1003, both taavi and myself looked at those, seems like a one-off failure with the NIC. I wouldn't worry if it doesn't happen again. [14:22:33] ok, thanks for checking [15:07:24] Raymond_Ndibe: I got an error when testing with the cli changes [15:07:29] https://www.irccloud.com/pastebin/b18bbm12/ [15:07:32] andrewbogott: anything more I can do to get the trixie image working? [15:08:08] not yet I don't think. I'm just working on getting wmcs-image-create to notice when cloud-init finishes. [15:08:14] it's a long wait between tests [15:29:39] taavi: the base image starts up now, although it's still slow and messy. Example of a new VM is trixietest-1.testlabs.codfw1dev.wikimedia.cloud [17:11:40] andrewbogott: I replied to T401347 and asked how much quota they need [17:11:40] T401347: Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347 [17:11:56] taavi (non urgent) the only remaining issue I see at the moment is with restarting sssd, it seems trixie needs yet a third version of the workaround in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007335 [17:12:06] dhinus: thanks!I was just about to get to that one [17:12:24] andrewbogott: it doesn't seem urgent, you can leave it for me tomorrow [17:12:43] sure, we need to wait for the user for now anyway [17:12:47] I also need to add some docs to wikitech for this type of requests, because I don't think we have them [17:13:58] * dhinus off [17:21:17] not really off, I got curious about the kernelerrors in cloudcephosd1014. looks like a small blip in the NIC card: "NIC Link is Down" followed by "NIC Link is Up" [17:21:42] T401615 [17:21:42] T401615: KernelErrors Server cloudcephosd1014 logged kernel errors - https://phabricator.wikimedia.org/T401615 [17:22:11] ceph health is HEALTH_OK so I think nothing to worry about [17:22:43] * dhinus off for real :) [17:34:06] * dcaro off [17:34:08] cya! [18:07:08] !topic ☁️ https://etherpad.wikimedia.org/p/WMCS-2025-08-14 | Channel is logged at https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/ | ping cteam | clinic duty: komla