[07:15:44] moritzm: cloud vps is now fully off of buster! [07:24:40] excellent \o/ [07:27:05] nice! [07:27:08] greetings [07:29:39] can someone explain the CI failure here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188828 [07:29:54] I'd like to test changes to wmcs-cookbooks.git and specifically the nfs cookbooks, is there a way I can iterate on the code "live"? I know test-cookbook though that takes a gerrit change [07:30:49] godog: you can run the cookbooks locally from your laptop, the README file has instructions iirc [07:31:48] taavi: thank you! will take a look [07:32:01] taavi: re: ci failure I bet that's "wmet" in 'typos' file [07:32:25] can be ignored IMHO [07:33:22] is there a magic comment or something to ignore those? or do you mean can be force merged? [07:33:45] I was thinking force merged, I'm not aware of comments to ignore typos [07:34:29] * taavi does [07:36:07] there's another 3 hits for allowmethods in puppet.git FWIW [07:59:37] morning [08:31:44] mmhh I'm trying the README.md instructions on wmcs-cookbooks to test/run cookbooks locally and 'cookbook' binary is not available or installed in the venv, has anyone run into the same? [08:33:31] let me check, it's been a while since I followed those [08:33:50] thank you, there's a missing 'pip install setuptools' for which I'll followup with a patch [08:34:13] in the meantime I'll reboot the stuck nfs workers ;_; [08:34:53] huh. setuptools is not installed by venv creation? [08:35:09] not on trixie / python3.13 at least [08:35:13] btw. should I reboot workers when I see them? [08:35:22] yes please, thank you [08:35:43] there's nothing more to be done on my end other than upgrade the nfs server to trixie at this point [08:37:24] I think it's https://docs.python.org/3/whatsnew/3.10.html#distutils [08:37:39] ah yeah totally [08:37:48] "The entire distutils package is deprecated, to be removed in Python 3.12." [08:40:50] I think that 'python setup.py install' does not pull the right deps [08:41:04] `pip install -e .` seems to do the trick though [08:41:17] https://www.irccloud.com/pastebin/2tUsWjPa/ [08:41:36] neat, yes that's it [08:41:41] I'll send a patch your way dcaro [08:41:49] 👍 [08:43:29] https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1189438 [08:46:06] +1d [09:04:20] hmm... it seems we are actually oversubcribing by a lot the cpu in the tools k8s cluster [09:04:33] (as in, we are reserving a lot of cpu that never gets used) [09:04:50] https://usercontent.irccloud-cdn.com/file/9pewFMaC/image.png [09:05:15] just added those to https://grafana-rw.wmcloud.org/d/8GiwHDL4k/infra-kubernetes-cluster-overview [09:06:09] the left is the allocations/reservations, the right the actual cpu/mem usage [09:09:50] neat [09:12:17] added now the limits/requests distinction to the first graph [09:25:17] dcaro: lol we arrived at the same idea re: memorymax [09:26:29] xd [09:29:59] shoot, messed up the `Hosts` header and it's running pcc for all [09:30:25] it seems to have found an error though `[ 2025-09-18T09:29:52 ] ERROR: Compilation failed for hostname alert2002.wikimedia.org in environment prod.` [09:36:12] I'd like to test changes to OpenstackAPI since I'm adding a bunch of methods, is there a (recommended?) way to get a wmcs-cookbooks spicerack shell? similar or the same as spicerack-shell in production I guess [09:37:40] I never used the spicerack shell, so no idea [09:38:01] to test code, usually I develop locally [09:38:12] or if I have already a patch using test-cookbook in cloudcumin1001 [09:38:34] what does the spicerack shell do? [09:38:59] drops you into a python repl with a spicerack instance available, though I've never used it myself [09:39:20] I'm at the patch stage now, I'd like to verify my methods to OpenstackAPI actually do what I expect [09:40:07] say for example this https://phabricator.wikimedia.org/P83423 [09:40:16] hmm... I usually add a `pdb.set_trace()` and run a silly cookbook locally [09:40:50] ok thank you yeah I'll try that [09:41:33] there's a few cookbook iirc for openstack that don't really change anything, those might be good places to use for it (so if you hit 'c' without noticing it does not change stuff) [09:42:27] wmcs.openstack.cloudnet.show might be one of those [09:42:56] ack will start from there [09:42:56] that might be the only one actually xd [09:47:03] lima (the tool behind lima-kilo) might move to 'incubation' under cncf: https://github.com/cncf/toc/issues/1348 [09:54:25] this is ready for review I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189439, the changes to pcc seem unrelated (only to the probes, looks like ip resolving differences) [09:56:01] no pcc for the toolforge prometheus server itself? [09:56:35] I cherry-picked it :) [09:56:43] does it work nowadays? [09:57:56] (as in manuallly cherry-picked in puppetserver and ran puppet on toolsbeta-prometheus-2) [10:00:26] dcaro: what's the link to pcc ? is it a noop in production? [10:01:43] godog: this is the old one (not including the tools/toolsbeta prometheus) https://puppet-compiler.wmflabs.org/output/1189439/7510/ [10:02:13] and the new one just passed now :/ [10:02:29] and it's cleaner https://puppet-compiler.wmflabs.org/output/1189439/7512/ [10:02:38] yeah LGTM [10:02:42] (no code changes, so something happened on pcc side) [10:03:16] it is fine as long as production is a noop so promehteus doesn't get restarted [10:04:09] ack [10:10:30] * dcaro lunch 🍝 [11:29:33] hmpf... tools-prometheus-9 is down again, I might have killed it before the patch applied, looking [11:34:43] yep, puppet had not run yet [13:33:47] btw. I used this script locally to run cookbooks, that I find very useful to match them by name https://gitlab.wikimedia.org/-/snippets/254 [13:33:59] so i write `wmcs-cookbooks vm_console ...` and it works [13:34:14] (never remember the full path) [13:47:15] anyone around can review and +1 if ok with this https://phabricator.wikimedia.org/T404668 ? [13:53:36] Hi. I'm running a decom and I have an unexpected DNS change about `cloudcephosd1021.private.eqiad` being removed. Is this OK to commit? [13:55:22] hmm... that node is still in the cluster and working, maybe andrewbogott or taavi know better? in the meantime let me look around [13:55:51] https://usercontent.irccloud-cdn.com/file/Wdpm3d88/image.png [13:58:14] it was reimaged yesterday https://sal.toolforge.org/production?p=0&q=cloudcephosd1021.private.eqiad&d= [13:59:29] I think it's ok though, iirc ceph nodes don't need that private ip, we were adding it for a bit, but newer nodes don't have it https://phabricator.wikimedia.org/source/netbox-exported-dns/browse/master/wikimedia.cloud-eqiad;4dba345d096b26a3ce90e7b6e35565efa5555a3e [13:59:47] it's not configured in the node either [14:00:08] btullis: +1 to go ahead [14:00:43] Great, thanks. [14:56:00] andrewbogott: ^ fyi. I hope I did not mess up xd [14:56:48] yeah, removing that is "fine" [14:57:19] dcaro: you were right. There are likely to be more of those from future reimages... leftovers from a semi-attempt to move ceph to a new network that was never properly started. [15:34:01] guys you might see some alerts from codfw cr1-codfw is rebooting [15:47:51] thanks for the warning topranks [15:48:10] it's back up now fyi [15:49:56] andrewbogott: I see this in the codfw1dev horizon logs when I try to log in: "ModuleNotFoundError: No module named 'openstack_auth.plugin.wmtotp'" [15:50:40] oh, that's a config file thing, I thought I fixed that [15:54:49] I see the issue; it only happens on login so I didn't hit the bug on account of having a session already. [15:54:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189526 [16:00:08] taavi: try again? [16:01:53] andrewbogott: seems to at least let me in now [16:02:42] great, now you can find the next crash [16:07:12] * taavi continues to click around [16:09:50] andrewbogott: do you think you have time for T404862? I don't even see an octavia policy file in puppet so not sure where to start [16:09:51] T404862: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862 [16:10:40] yes -- we don't have any overrides currently so there's no need for a policy file but it's easy to add. [16:10:44] go ahead and assign to me [16:24:00] I'm thinking on dropping completely the cpu limits for user pods, wdyt? [16:24:33] in theory, under load, the workers will allocate cpus according to the relative request values, but will allow the jobs to use as much cpu as it's available [16:25:03] I would like to see that theory validated first [16:25:13] hahaahah ack [16:25:39] I can try to run an experiment yep [16:27:54] note that this is for cpu only, memory behaves differently [16:53:11] this does a first reduction of the requests for default values https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/215 [17:15:45] * dcaro off [17:19:13] fyi. added some notes on the cpu/memory requests/limits thingie here T404726, will continue investigating next week [17:19:13] T404726: [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726 [17:56:27] andrewbogott: T405017 unfortunately those were dist-upgraded in https://phabricator.wikimedia.org/T367546 [17:56:28] T405017: Buster VMs in cloud-vps PKI project - https://phabricator.wikimedia.org/T405017 [17:56:54] bah, well maybe we can still delete them :) [23:47:44] have we had a change on bastion of toolforge? I'm getting this [23:47:53] The fingerprint for the ED25519 key sent by the remote host is [23:47:53] SHA256:0i1eqK9uOYmCjOe5a0oAWTmnEPUh0b7h2Flm1IDl0sg. [23:48:22] I might have missed some announcements (my apologies) [23:58:06] ja [23:58:07] https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/I4M335NMS6CT23AT23P5PL4N3NUI2YMT/