[00:46:11] * bd808 off [09:01:28] morning [09:08:07] o/ [09:10:14] morning! [09:24:45] * arturo has laptop issue [09:25:45] morning [09:45:47] side note: cloudcumin servers could probably benefit from having an interface in cloud-private subnet [09:50:16] why? [09:50:31] so all cloud traffic circulates on cloud-private [09:53:27] moreover, if all hardware servers are headed into the cloud-private setup, then we could even evaluate having cloudcumins be normal VMs on CloudVPS instead of ganetti. Are the prod secrets on them? [09:56:19] yes, ssh keys that can access prod servers and VMs [09:56:31] plus they can talk to prod network only things like prod alertmanager [09:56:50] ok [09:57:18] plus chicken-and-egg problems apply if we need cloudcumins to fix some outage [09:57:40] ok, then cloud-private addition only [09:58:18] hmm [09:58:59] we do ssh now, and ssh is bound on some servers (like cloudvirts which don't have a host-level firewall) to the cloud-hosts interface only [09:59:26] do they talk other protocols to hosts with a leg in cloud-private? [10:00:38] I was thinking about this action, for example: [10:00:40] sudo cumin 'O{project:toolsbeta name:toolsbeta-test-k8s-*}' 'apt-get update' [10:00:41] also, I wonder if it's even possible atm to add a cloud-private leg on cloudcumins as they are ganeti VMs [10:01:51] that should be pure cloud traffic, but we may need to add cloud-private to cloudgw servers as well (I don't remember if they have it already) [10:02:12] anything, this sounds like a pedantic optimization at this point [10:09:04] s/anything/anyway/g [10:18:12] could you please approve this plan? https://phabricator.wikimedia.org/T356507#9516468 [10:19:51] your previous comment says "affected packages are `runc` and `containerd`", but your command is updating `containerd.io` instead of `containerd`, is that expected? [10:26:12] yes, good question [10:26:32] so, in bookworm, the CVE does not affect the `containerd` package in the official archive, only `runc` [10:26:47] so I'm not updating `containerd` from the official archive [10:26:57] but I'm updating `containerd.io` from our reprepro [10:27:14] runc update --> bookworm [10:27:21] containerd.io update --> buster [10:27:45] more info: https://security-tracker.debian.org/tracker/CVE-2024-21626 [10:30:09] hmmm [10:30:35] containerd in bookworm has a `Built-Using: runc (= 1.1.5+ds1-1)` which is the vulnerable version [10:31:18] so the vulnerable code is not in the part of runc that containerd uses as a go library, but in the part that it runs as a subprocess? [10:31:49] according to https://tracker.debian.org/media/packages/c/containerd/control-1.6.20ds1-1 this is just a runtime dependency? [10:33:33] Source: containerd has Build-Depends: golang-github-opencontainers-runc-dev, which is a binary package provided by Source: runc [10:34:34] dcaro: Raymond_Ndibe: can either of you review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/28 to unbreak CI workflows? [10:34:40] moritzm: could you please advice here for a moment? [10:39:51] containerd has a runtime dependency on runc, it doesn't need to be rebuilt, the Built-Using is mostly for GPL compliance to indicate what headers were used during the build [10:40:06] so we only need to update runc on bullseye/bookworm [10:40:26] ok, thanks [10:40:26] not sure about the imported thirdparty stuff for buster. but I can have a closer look [10:40:52] ok, thanks [10:41:03] arturo: then your plan seems fine [10:41:23] thanks for the double check [10:41:27] I'll proceed now [11:13:34] moving now into toolforge [11:20:32] taavi: could you please remind me what was the gitlab script that replicated gerrit-style workflows? [11:20:49] arturo: https://github.com/yaoyuannnn/gerritlab [11:28:04] thanks [11:40:24] keep in mind https://phabricator.wikimedia.org/T353740, for every MR, there will be a test run + image created + push to harbor, so if you don't plan on merging the MRs at different times (thus the commits/MRs are bound together), consider using multi-commit MRs [11:40:40] you can still review single commits on the same MR [12:04:52] FYI the reboot script is now rebooting NFS-enabled workers in tools [12:05:22] all of them are NFS-enabled, it's just that the new ones have that in the name so we can eventually add non-NFS-enabled workers [12:05:38] oh, ok :-) [12:05:46] * dcaro lunch [12:17:10] I need to step stretch my legs, the reboot script is working just fine, now rebooting worker -89 (descending order) [12:17:27] I'll leave it in a screen session in cloudcumin1001 in case you need to take over [12:17:31] * arturo be back in a bit [13:09:26] ok, toolforge k8s fully rebooted [13:23:15] 🎉 [13:27:42] I have expanded the documentation for the various different metricsinfra services https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring [13:37:23] thanks! [13:38:33] please let me know if anything is missing, unclear, etc [13:40:10] thanks! [13:41:59] * arturo just discovered `pipx` [13:42:02] taavi: "Ask Taavi if unsure." should probably by 'ask WMCS' [13:42:51] dcaro: i mean, the documentation is targeted towards WMCS so that's not very helpful for them. but yes, not ideal. [14:48:18] * arturo out for a bit [17:22:30] dcaro, regarding the slow api alerts, do you know offhand what api check it's measuring? Is it averaging all calls together? [17:23:24] essentially this graph: https://grafana-rw.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency?forceLogin&forceLogin&orgId=1&refresh=30s&var-cloudcontrol=cloudlb1001&var-cloudcontrol=cloudlb1002&from=now-30d&to=now&var-backend=nova-api_backend [17:23:24] It's splitting per-api + per lb + per backend (as in you get nova-api|coludlb1001|cloudcontrol1005 as a single alert) [17:23:36] ok [17:23:41] So it does include rabbit round-trips then [17:23:43] but for each api (nova-api) on a cloudcontrol it aggregates all the calls [17:23:54] (no split by path) [17:24:11] I was just thinking, if it was just a health check then rabbit wouldn't be counted [17:24:53] ok, let's see if I can change a single number in the rabbitmq config without breaking everything... [17:37:33] andrewbogott: gtg, let me know how it goes, I'll keep an eye tomorrow [17:37:38] ok! [18:44:57] * bd808 lunch