[01:42:12] * bd808 off [07:47:18] good morning [08:48:28] morning [08:56:12] morning [09:21:51] dcaro: is it expected that the rust buildpack cannot be used when building an image locally? [09:26:16] taavi: I'm about to run the cookbook to introduce a new control node in toolforge [09:26:48] ok, what exact command are you using? [09:27:12] crafting that ATM [09:28:25] taavi: [09:28:26] aborrero@cloudcumin1001:~ $ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --task-id T284656 --role control --image debian-12.0-bookworm [09:28:26] T284656: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 [09:29:01] arturo: LGTM. note that you do not need to explicitely specify an image as we're using the same image as the last one [09:29:16] ok, I wasn't sure, but it makes sense [09:29:32] running it [09:35:31] today I learned that you can show the console log for VMs in the CLI, that we usually query via horizon: [09:35:32] sudo wmcs-openstack console log show 1d0839d5-0a11-4ed2-83b8-b0c15c62a5ae [09:36:00] oh nice [09:42:48] so restarting kubelet does not kill the running static pods, because containerd runs as a separate proc, not a child [09:43:00] https://www.irccloud.com/pastebin/gzhQwWrM/ [09:53:23] taavi: I'm observing similar errors on the kube-system pods on the new control node [09:53:37] I'm already restarted the static pods, but I don't remember what else did we do yesterday? [09:55:54] we also rebooted the nodes [09:56:22] ok [10:01:18] if I recall correctly from when I last looked into this, the issue had something to do with the controller-manager and scheduler starting too soon when apiserver had not booted yet. so resetting only those two static pods should fix it [10:03:41] yeah, that was the fix! [10:03:44] taavi: hey, sorry in a meeting [10:05:51] let me check, I think the original buildpack (heroku-based) does not support the latest buildpack API (0.10), so trying to run locally might not work yes [10:06:09] we can try to publish our version with 0.10 support so people can use it locally [10:08:49] I was going based on just https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Testing_locally_(optional), which was giving a 'ERROR: failed to detect: no buildpacks participating' [10:09:11] trying with `--buildpack emk/rust` gives a 'no such file or directory' which is probably what you mention [10:09:23] but it'd be great if the detection logic was the same locally and in buildservice [10:10:39] yep, it's not though, we use the `inject_buildpacks` step that adds the extra buildpacks to the builder image, so just using the image does not work [10:11:13] for that we would have to publish all the custom buildpacks (with the 0.10 API fixes) in a 'buildpack format' (image layers), and rebuild the builder image adding the buildpacks directly there [10:11:22] (that is doable, but it's more work to maintain) [10:11:41] we might go that route in the future, as it still has the benefit of being able to build locally [10:40:06] dcaro: need help with T358194? I think I spotted at least one issue [10:40:07] T358194: [jobs-api] Getting errors when listing jobs - https://phabricator.wikimedia.org/T358194 [10:40:36] taavi: sure, I'm working on the empty logline parsing (I think that's one issue, but not the one you get when listing jobs) [10:41:03] ok, I found the reason for the job listing issue, https://phabricator.wikimedia.org/P57691 [10:41:05] so there's at least the other, in the task I correlate the api log with the cli list, but I think that's not correct, that api log is from the `logs` cli [10:41:23] the extra fields? [10:41:31] one of those nodes is not NFS :-) [10:41:52] * arturo back in a bit [10:42:16] ooohh [10:46:05] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/62 [10:46:59] it might be the same issue, as in it might be getting the logs empty because it can't get them, otherwise it would get them as it expects (formatted with the date) [10:47:41] I kind of doubt that, but we can see if it's still broken after deploying that [10:51:22] this should help for the logs, though I'm still trying to find out the root cause: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/42 [10:53:26] fix is live.. is the logs endpoint still failing? [10:54:00] nope [10:54:19] >10 in a row without errors [10:54:23] > 10 [10:54:56] interesting [10:55:19] yep, no failures, if the issue was nfs, any action "as the user" would have failed right? that would include getting the logs [10:55:25] yep [10:56:17] now that I think of it, it's interesting that it has never been affected by NFS issues [10:56:35] (the access is quite sporadic and short though, so maybe just lucky) [10:56:53] and it does not write [10:57:52] i think we should eventually move jobs-api to use kubernetes impersonation, but that's something for an another day [10:58:33] we use a custom service account for the other apis [10:58:38] wouldn't that be enough? [10:59:46] something like that, just need to make sure it has the same permissions as the normal tool credentials do [11:00:37] ack [11:01:59] * dcaro brb [11:23:44] why is the k8s-api-server looging in kind of "trace" mode? [11:42:33] question, with the runtime_description() property in the cookbooks, we are supposed to don't ever call the sallogger directly anymore? I don't see entries reaching the -feed channel anyway? [11:47:04] i.e: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1005729 [11:50:31] `runtime_description` is preferred whenever possible to reduce the number of SAL entries per cookbooks. and when that's not possible, spicerack's native `sal_logger` should be used instead of the custom class in `wmcs_libs` [11:54:06] ok, then I guess the patch is OK. Please approve. [11:57:25] +1 [12:02:20] thanks! [12:02:27] update on the ceph mystery from yesterday: https://phabricator.wikimedia.org/T358101#9567275 [12:03:49] TL/DR: harmless puppet race condition on the first run [12:32:03] please approve: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005733 [12:32:54] can you run a pcc on cloudvirt? [12:33:05] oh, good catch [12:35:32] compiles just fine [12:36:04] FYI I'm going to reimage cloudvirt1034 shortly [12:36:48] sorry, one more thing, does it compile on a cloudvirtlocal node too? [12:37:02] let me check [12:42:52] taavi: done. ceph.conf is still deployed via profile::cloudceph::client::rbd_libvirt [12:44:02] ok. +1'd [12:46:57] thanks [12:54:29] when draining a cloudvirt, what do we do with VMs in SUSPENDED state? [12:54:33] the script can't handle them [12:57:03] uh [12:57:45] I will resume + shutoff so I can proceed with the maintenance, but we should maybe think of a proper procedure [12:59:52] quick +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005750 [13:00:22] done [13:00:26] thanks [14:05:26] arturo: I think this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005733 has made CI sad :-( [14:05:50] slyngs: don you have an example? [14:05:54] do* [14:06:23] Right, sorry, https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/79838/console [14:07:15] looking [14:08:14] Thanks [14:08:23] https://www.irccloud.com/pastebin/fB9UlmpU/ [14:10:03] blancadesal: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1005766 [14:10:34] slyngs: I'm very confused. The error seems legit, but I would have expected this to show up when the original change was created in gerrit [14:12:05] Doesn't is only fail once the changes have been applied by Puppet, so it would work fine on it's one, then fail after the next update? [14:12:37] work fine on CI check on it's own, then fail after the next update [14:14:48] what's the point of the unit tests if they only surface problems after the fact? [14:16:13] I'm writing a fix [14:19:03] Thank you, I'm not sure if there's a good way of ensuring that this won't happen using tests. PCC would have found it I believe, but only if targeted at the right hosts [14:19:30] I can bring it up at the puppet office hours in 40 minutes [14:21:07] this is a spec test failure, no? [14:21:45] I not actually entirely sure [14:23:12] I believe is a missing dependency on the spec test suite [14:30:34] slyngs: I believe this is the fix https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005769 please +1 [14:33:22] done [14:35:01] thanks, and sorry [14:35:24] No need to be :-) [14:35:27] this should definitely been detected by the CI earlier [14:35:51] when the class was changed [14:36:04] Yeah, it give a falls sense of security. [14:36:31] * arturo food time [16:34:50] arturo: hey, you reimaged cloudvirt1034 today did you? [16:35:01] topranks: yes I did [16:35:05] doesn't seem to have caused the same problem with puppetdb -> netbox import script? [16:35:10] no! [16:35:16] it went smooth [16:35:18] hmm ok [16:36:15] yeah some race condition with the interfaces getting changed I think, some odd state in puppetdb where maybe a 'parent' of a child interface is missing or something [16:36:28] we'll see how it goes, glad today's was smooth anyway! [17:09:09] * arturo offline [17:12:15] following the discussion in today's team meeting, I created T358251 [17:12:15] T358251: Test using phabricator-maintenance-bot to sync wmcs-related boards - https://phabricator.wikimedia.org/T358251 [17:12:45] I have a few other ideas that I will try to track in that task or in a separate one [17:52:21] * dcaro off [18:48:05] * bd808 lunch [22:09:42] * bd808 walk