[06:39:17] greetings [07:36:16] I'm reading up on toolforge services wrt the POC and was wondering: I see docker-registry and harbor mentioned, are both used atm? my understanding is that harbor is a docker registry [07:55:31] docker-registry is older, and currently hosts the basic images to bootstrap k8s (maybe not all-all, it's been a while since we double checked), and the pre-built images because those were built before harbor was a thing. Then harbor hosts the custom components (jobs-api, builds-api, ...) and the user-built images [07:55:56] (harbor was preferrable for users to have multitenancy, that docker-registry does not provide) [07:56:17] got it, thank you dcaro ! appreciate it [07:56:28] btw. got paged for a second there about nfs, all good right? [07:57:19] yes, sorry about that dcaro, my bad I didn't realize a quick reboot/resize pages [07:57:40] np, it will depend if tools-checker catches it on the fly or not :) [07:58:02] heheh true [07:58:40] hah toolschecker is on icinga, I just saw [07:58:44] on -cloud-feed that is [08:26:45] yep, there's a task somewhere to move it to prometheus somehow [08:26:59] quick review, adding logs-api to prometheus targets https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197587 [08:28:43] +1 [08:31:52] thanks [08:32:42] opened T407837 to track that metricsinfra-thanos-fe puppet failure, seemingly there's been a refactor of the puppet code that didn't realize we're also using that class [08:32:43] T407837: metricsinfra-thanos-fe-2 Puppet failure - https://phabricator.wikimedia.org/T407837 [08:34:54] ack [08:48:04] morning [08:51:28] I'm confident enough to nuke tools-nfs-2 at this point, objections? I'll shut it down first and nuke it later in the week [08:52:32] +1 for shutting down [08:52:56] cheers [08:54:40] only if you do it with --on-fire [08:54:44] +1 for shut it too [08:55:06] huh, do we have an actual shredding service in the dc? [08:55:18] for disks, yes [08:55:36] ^.^ [08:56:00] https://drive.google.com/drive/folders/1VQbm1IwDUn-zk22TGa3yf6tlETqdXjg5 [08:56:03] ;) [08:56:09] an example from the past [08:56:56] oddly satisfying [08:57:41] iirc it is like a van that shows up [08:57:54] yes I think so [09:01:44] * dcaro starting to imagine them in suits with sunglasses and a single earplug, "flashing your hard drives 24/7" [09:02:29] lol [09:03:09] haha! [09:21:44] * godog errand and lunch [10:04:49] hmm, loki is not being scraped correctly, it's failing to fetch stats [10:04:52] looking [10:21:51] hmpf. it does not seem to be so simple, as we have the network policies blocking things [10:26:36] godog: forgot, harbor also hosts the charts we generate, not only images [10:44:01] maybe we can fix it with the same thing I'm doing for loki-tracing? (still WIP but making progress) [10:50:37] maybe, looking if it's just a network policy missing or something [10:53:00] hmm, it works on lima-kilo [10:53:05] dcaro@toolslocal$ curl --cert k8s.cert --key k8s.key --insecure https://127.0.0.1:6443/api/v1/namespaces/loki/pods/loki-tools-0:3100/proxy/metrics [10:53:35] maybe because the api service is running on the same nod? [10:55:47] but I can reproduce in toolsbeta :), that gives me a safe playground [10:55:53] :D [10:56:16] anyhow, after lunch stuff 👍 [10:56:18] * dcaro lunch [11:28:11] while playing with loki-tracing I encountered the following error: "admission webhook "registry-admission.tools.wmcloud.org" denied the request: The following container images .... ", for which I *think* I need this MR: https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/31 [11:28:35] if anyone could have a look and tell me if it makes any sense or I'm doing something else wrong [11:28:40] * volans lunch [11:32:02] volans: uh, what is it trying to pull from an external registry? [11:34:58] Image=docker.io/nginxinc/nginx-unprivileged:1.28-alpine when I enabled the gateway via config (ony for the -tracing one, not for the -tools one) to expose it outside k8s [11:35:06] *only [11:36:18] config in https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/b59c626362599a1815b53a0f8dfc92f81bc752f4/components/logging/values/loki-tracing/common.yaml.gotmpl#L3 [11:37:05] * volans lunch for real now :D [11:44:45] iirc we already have a nginx image in our registry, we should use that instead of relying on docker.io [12:13:14] that would be great, but how to do that while using the upstream chart? [12:13:31] I just enabled the gateway in the loki's config [12:13:44] let me check if that's configurable [12:14:07] https://artifacthub.io/packages/helm/grafana/loki?modal=values&path=gateway.image.registry says by using gateway.image.* values [12:15:56] * volans looking at: [12:15:57] https://github.com/grafana/loki/blob/main/production/helm/loki/templates/gateway/deployment-gateway-nginx.yaml#L59 [12:17:45] taavi: do you have handy the address I should use for our image by any chance? [12:18:29] https://docker-registry.toolforge.org/#!/taglist/nginx [12:19:31] thx [12:30:11] hmmm... something is going on with irc [12:30:34] now just I see volans comments, I saw one of taavi's only before and I was a bit confused [12:31:51] does docker-registry.svc.toolforge.org works also from outside? (like in lima-kilo) [12:34:40] yes [12:34:50] (only pull) [12:35:21] btw not sure if related to the change from yesterday alloy fails locally and looking at logs complains about the comments in the file [12:35:26] /etc/alloy/config.alloy:101:5: illegal character U+0023 '#' [12:35:31] # there's ~40 pods per worker, so worst case scenario a pod gets 10 logs/s ensured [12:38:11] yep, should be fixed [12:38:33] thx, rebasing [12:40:36] okok, I found a networkpolicy that allows the prometheus traffic in [12:40:45] but it requires me to list the ips of the control nodes [12:40:47] :/ [12:41:07] any ideas how to do so dynamically in the networkpolicy? [12:43:08] this is what I have [12:43:10] https://www.irccloud.com/pastebin/JjV8HO9n/ [12:43:37] oh, maybe the kube-system filter might work, my issue was the component label and I did not try that again after adding backend there [12:49:41] nope, kube-system namespace is not enough :/ [12:51:52] hmm.... it's weird, because the requests when doing `root@toolsbeta-prometheus-2:~# curl -H "X-Test: lerele" --key /etc/ssl/private/toolsbeta-k8s-prometheus.key --cert /etc/ssl/localcerts/toolsbeta-k8s-prometheus.crt --insecure -v https://toolsbeta-test-k8s-control-11.toolsbeta.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/loki/pods/loki-tools-backend-0:3100/proxy/metrics`, don't come from the control node regular ip (172.16....) [12:51:52] but from a pod-network ip configured there (192.168....) [12:51:54] https://www.irccloud.com/pastebin/CqeJVs92/ [12:52:50] and it's natting from one network to the other [12:53:44] https://www.irccloud.com/pastebin/gWOKdtVx/ [13:00:34] opened T407852 to keep track [13:00:35] T407852: [infra,logging] prometheus failing to fetch the metrics endpoint - https://phabricator.wikimedia.org/T407852 [13:15:37] dcaro: ack, thank you re: harbor [13:42:04] looking for reviews of https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/94 and 95 [14:01:28] LGTM [14:04:38] I'm looking at putting more weight on cloudcephosd1050 and then 1051, my understanding is that this is the correct thing to do? cookbook wmcs.ceph.osd.undrain_node --cluster-name eqiad1 --osd-hostname cloudcephosd1050 --task -id T405478 [14:04:39] T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 [14:04:48] --task-id not --task -id, you get the idea [14:05:08] to undrain 2 osd, wait, then repeat [14:14:59] I've re-created my lima-kilo to make sure I didn't had any local quirks and now it doesn't start the minio pod for the loki-tracing instance, sigh [14:15:02] in the debug log I get: [14:15:02] loki-tracing-minio-post-job: Jobs active: 0, jobs failed: 0, jobs succeeded: 0 [14:15:16] but no obvious error or logs taht I can see [14:17:14] does it ring any bell by any chance? [14:18:18] godog: that command looks right to me, I would add --batch-size 1 if you haven't already started it. [14:19:10] andrewbogott: I haven't! jumping in a meeting rn then will kick off the cookbook, thank you [14:29:59] next up: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/96 [14:54:53] and now after a restart it doesn't even start the VM,... recreating [14:56:20] that's weird :/, I can restart it, though it takes a while to start all the pods correctly (it kinda tries to start everything at the same time, until things start in the right order) [14:57:05] it was failing to get the disk [14:57:06] of the VM [14:58:49] oh, never had that error [14:59:23] the user facing error was waiting for ssh to work [15:00:11] the log had something about the disk, but re-creating I think it has overriden it, so I lost it, sorry [15:04:19] dcaro: btw the prompt changed (was green earlier) and I also got a missing ~/bin in PATH so when logging into lima-kilo: [15:04:25] /home/volans.linux/.bashrc: line 9: kubectl: command not found [15:04:37] I didn't check it yet if there was any recent change related [15:41:27] patches I was talking about: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/96 and 97, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197308 [15:42:44] 👍 [15:44:42] godog: answering your question about preseed-test, I'm running the test itself on a trixie VM and trying to create a trixie VM (virt-on-virt). IIRC it worked properly when the guest was Bookworm [15:45:14] But this can wait until you have time to think about it :) [15:46:13] andrewbogott: ok! would you mind sharing the full output and possibly one with 'set -x' added to the top of test.sh ? [15:46:57] sure. I'll start that now but it'll take a while [15:47:13] sure np, I'll get going with undrain_node [15:48:46] ...or maybe not since it fails right away! [15:49:26] I think I'm wrong and it's the 4-drive thing that breaks it, not the trixie thing [15:49:41] Sorry, I remember that it worked for me once on Friday, I think that was with all defaults (which would be 2 drives) [15:49:42] https://phabricator.wikimedia.org/P84205 [15:50:36] bonus: [15:50:40] https://www.irccloud.com/pastebin/aJNQYOj8/ [15:51:32] andrewbogott: ack thank you, my local system is trixie now and I'll try to reproduce [15:51:40] thx [16:02:14] did any of you do anything to cloudcontrol1007? [16:07:23] not me [16:07:26] what's it doing? [16:08:21] `JobUnavailable wmcs (maintain_dbusers_eqiad cloud critical eqiad prometheus)` triggered for a second, it has happened a couple times before [16:10:44] andrewbogott: fix is https://gitlab.wikimedia.org/repos/sre/preseed-test/-/merge_requests/6 [16:11:18] andrewbogott: also I confirm sudo is not involved, I'm running all commands in the README without it [16:12:20] oh, cool [16:14:00] oh, it's doing things now! [16:15:00] sweet [16:17:06] undrain_node is running under screen with my user on cloudcumin1001, I take it I should leave it there do to its thing and will put all osds in service over time [16:20:47] that's the idea! There are some timeouts so it might just time out midway through but it hasn't done that for me lately. [16:21:11] ok thank you [16:22:23] I mention because we're working with new servers that are much bigger than the ones we've tested with the most [16:22:37] so things that took a few hours on old servers now take all day [16:24:38] * godog nods [16:26:48] running a debian VM on another devian VM which is itself running on hardware emulation is... not fast [16:27:03] But we've got to keep those hardware vendors in business somehow [16:31:06] * dhinus off [16:46:47] andrewbogott: for toolforge, I think this is the current raw usage: [16:46:49] https://usercontent.irccloud-cdn.com/file/uczfNbFx/image.png [16:47:07] thank you! [16:48:37] That's actual, not requested? [16:48:41] yep [16:48:53] using: `(sum (irate(node_cpu_seconds_total{project="tools",mode!~"idle|guest|guest_nice", instance=~"tools-k8s-worker-.*"}[$__rate_interval])))` [16:48:57] So about 4 GB of RAM per core [16:49:06] and ` sum(node_memory_MemTotal_bytes{project="tools",instance=~".*-k8s-worker-.*"} - node_memory_MemAvailable_bytes{project="tools",instance=~".*-k8s-worker-.*"})` [16:49:32] a bit more yes [16:52:33] those cpus are probably hyperthreaded cores though, right? So it's actually max 44 physical cores... [17:04:49] I'd say so, have not checked [17:05:55] I put those numbers on the task but I'm still confused about what 'config C' is normally [17:41:33] d.hinus: Nice find on that MTU mismatch. I had totally written that problem off as github edge traffic restrictions. [17:57:19] d.caro: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/cinder-csi-plugin/using-cinder-csi-plugin.md is the cinder plugin that Magnum provisions. [17:59:37] 👍 thanks! [18:01:48] * dcaro off