[06:39:17] <godog>	 greetings
[07:36:16] <godog>	 I'm reading up on toolforge services wrt the POC and was wondering: I see docker-registry and harbor mentioned, are both used atm? my understanding is that harbor is a docker registry
[07:55:31] <dcaro>	 docker-registry is older, and currently hosts the basic images to bootstrap k8s (maybe not all-all, it's been a while since we double checked), and the pre-built images because those were built before harbor was a thing. Then harbor hosts the custom components (jobs-api, builds-api, ...) and the user-built images
[07:55:56] <dcaro>	 (harbor was preferrable for users to have multitenancy, that docker-registry does not provide)
[07:56:17] <godog>	 got it, thank you dcaro ! appreciate it
[07:56:28] <dcaro>	 btw. got paged for a second there about nfs, all good right?
[07:57:19] <godog>	 yes, sorry about that dcaro, my bad I didn't realize a quick reboot/resize pages
[07:57:40] <dcaro>	 np, it will depend if tools-checker catches it on the fly or not :)
[07:58:02] <godog>	 heheh true
[07:58:40] <godog>	 hah toolschecker is on icinga, I just saw
[07:58:44] <godog>	 on -cloud-feed that is
[08:26:45] <dcaro>	 yep, there's a task somewhere to move it to prometheus somehow
[08:26:59] <dcaro>	 quick review, adding logs-api to prometheus targets https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197587
[08:28:43] <taavi>	 +1
[08:31:52] <dcaro>	 thanks
[08:32:42] <taavi>	 opened T407837 to track that metricsinfra-thanos-fe puppet failure, seemingly there's been a refactor of the puppet code that didn't realize we're also using that class
[08:32:43] <stashbot>	 T407837: metricsinfra-thanos-fe-2 Puppet failure - https://phabricator.wikimedia.org/T407837
[08:34:54] <dcaro>	 ack
[08:48:04] <dhinus>	 morning
[08:51:28] <godog>	 I'm confident enough to nuke tools-nfs-2 at this point, objections? I'll shut it down first and nuke it later in the week
[08:52:32] <dhinus>	 +1 for shutting down
[08:52:56] <godog>	 cheers
[08:54:40] <volans>	 only if you do it with --on-fire
[08:54:44] <dcaro>	 +1 for shut it too
[08:55:06] <dcaro>	 huh, do we have an actual shredding service in the dc?
[08:55:18] <volans>	 for disks, yes
[08:55:36] <dcaro>	 ^.^
[08:56:00] <volans>	 https://drive.google.com/drive/folders/1VQbm1IwDUn-zk22TGa3yf6tlETqdXjg5
[08:56:03] <volans>	  ;)
[08:56:09] <volans>	 an example from the past
[08:56:56] <godog>	 oddly satisfying
[08:57:41] <godog>	 iirc it is like a van that shows up
[08:57:54] <volans>	 yes I think so
[09:01:44] * dcaro starting to imagine them in suits with sunglasses and a single earplug, "flashing your hard drives 24/7"
[09:02:29] <volans>	 lol
[09:03:09] <godog>	 haha!
[09:21:44] * godog errand and lunch
[10:04:49] <dcaro>	 hmm, loki is not being scraped correctly, it's failing to fetch stats
[10:04:52] <dcaro>	 looking
[10:21:51] <dcaro>	 hmpf. it does not seem to be so simple, as we have the network policies blocking things
[10:26:36] <dcaro>	 godog: forgot, harbor also hosts the charts we generate, not only images
[10:44:01] <volans>	 maybe we can fix it with the same thing I'm doing for loki-tracing? (still WIP but making progress)
[10:50:37] <dcaro>	 maybe, looking if it's just a network policy missing or something
[10:53:00] <dcaro>	 hmm, it works on lima-kilo
[10:53:05] <dcaro>	 dcaro@toolslocal$ curl --cert k8s.cert --key k8s.key --insecure https://127.0.0.1:6443/api/v1/namespaces/loki/pods/loki-tools-0:3100/proxy/metrics
[10:53:35] <dcaro>	 maybe because the api service is running on the same nod?
[10:55:47] <dcaro>	 but I can reproduce in toolsbeta :), that gives me a safe playground
[10:55:53] <volans>	 :D
[10:56:16] <dcaro>	 anyhow, after lunch stuff 👍
[10:56:18] * dcaro lunch
[11:28:11] <volans>	 while playing with loki-tracing I encountered the following error: "admission webhook "registry-admission.tools.wmcloud.org" denied the request: The following container images .... ", for which I *think* I need this MR: https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/31
[11:28:35] <volans>	 if anyone could have a look and tell me if it makes any sense or I'm doing something else wrong
[11:28:40] * volans lunch
[11:32:02] <taavi>	 volans: uh, what is it trying to pull from an external registry?
[11:34:58] <volans>	 Image=docker.io/nginxinc/nginx-unprivileged:1.28-alpine when I enabled the gateway via config (ony for the -tracing one, not for the -tools one) to expose it outside k8s
[11:35:06] <volans>	 *only
[11:36:18] <volans>	 config in https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/b59c626362599a1815b53a0f8dfc92f81bc752f4/components/logging/values/loki-tracing/common.yaml.gotmpl#L3
[11:37:05] * volans lunch for real now :D
[11:44:45] <taavi>	 iirc we already have a nginx image in our registry, we should use that instead of relying on docker.io
[12:13:14] <volans>	 that would be great, but how to do that while using the upstream chart?
[12:13:31] <volans>	 I just enabled the gateway in the loki's config
[12:13:44] <volans>	 let me check if that's configurable
[12:14:07] <taavi>	 https://artifacthub.io/packages/helm/grafana/loki?modal=values&path=gateway.image.registry says by using gateway.image.* values
[12:15:56] * volans looking at:
[12:15:57] <volans>	 https://github.com/grafana/loki/blob/main/production/helm/loki/templates/gateway/deployment-gateway-nginx.yaml#L59
[12:17:45] <volans>	 taavi: do you have handy the address I should use for our image by any chance?
[12:18:29] <taavi>	 https://docker-registry.toolforge.org/#!/taglist/nginx
[12:19:31] <volans>	 thx
[12:30:11] <dcaro>	 hmmm... something is going on with irc
[12:30:34] <dcaro>	 now just I see volans comments, I saw one of taavi's only before and I was a bit confused
[12:31:51] <volans>	 does docker-registry.svc.toolforge.org works also from outside? (like in lima-kilo)
[12:34:40] <dcaro>	 yes
[12:34:50] <dcaro>	 (only pull)
[12:35:21] <volans>	 btw not sure if related to the change from yesterday alloy fails locally and looking at logs complains about the comments in the file
[12:35:26] <volans>	  /etc/alloy/config.alloy:101:5: illegal character U+0023 '#'
[12:35:31] <volans>	 # there's ~40 pods per worker, so worst case scenario a pod gets 10 logs/s ensured
[12:38:11] <dcaro>	 yep, should be fixed
[12:38:33] <volans>	 thx, rebasing
[12:40:36] <dcaro>	 okok, I found a networkpolicy that allows the prometheus traffic in
[12:40:45] <dcaro>	 but it requires me to list the ips of the control nodes
[12:40:47] <dcaro>	 :/
[12:41:07] <dcaro>	 any ideas how to do so dynamically in the networkpolicy?
[12:43:08] <dcaro>	 this is what I have
[12:43:10] <dcaro>	 https://www.irccloud.com/pastebin/JjV8HO9n/
[12:43:37] <dcaro>	 oh, maybe the kube-system filter might work, my issue was the component label and I did not try that again after adding backend there
[12:49:41] <dcaro>	 nope, kube-system namespace is not enough :/
[12:51:52] <dcaro>	 hmm.... it's weird, because the requests when doing `root@toolsbeta-prometheus-2:~# curl -H "X-Test: lerele" --key /etc/ssl/private/toolsbeta-k8s-prometheus.key --cert /etc/ssl/localcerts/toolsbeta-k8s-prometheus.crt --insecure -v https://toolsbeta-test-k8s-control-11.toolsbeta.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/loki/pods/loki-tools-backend-0:3100/proxy/metrics`, don't come from the control node regular ip (172.16....) 
[12:51:52] <dcaro>	 but from a pod-network ip configured there (192.168....)
[12:51:54] <dcaro>	 https://www.irccloud.com/pastebin/CqeJVs92/
[12:52:50] <dcaro>	 and it's natting from one network to the other
[12:53:44] <dcaro>	 https://www.irccloud.com/pastebin/gWOKdtVx/
[13:00:34] <dcaro>	 opened T407852 to keep track
[13:00:35] <stashbot>	 T407852: [infra,logging] prometheus failing to fetch the metrics endpoint - https://phabricator.wikimedia.org/T407852
[13:15:37] <godog>	 dcaro: ack, thank you re: harbor
[13:42:04] <taavi>	 looking for reviews of https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/94 and 95
[14:01:28] <godog>	 LGTM
[14:04:38] <godog>	 I'm looking at putting more weight on cloudcephosd1050 and then 1051, my understanding is that this is the correct thing to do? cookbook wmcs.ceph.osd.undrain_node --cluster-name eqiad1 --osd-hostname cloudcephosd1050 --task -id T405478
[14:04:39] <stashbot>	 T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478
[14:04:48] <godog>	 --task-id not --task -id, you get the idea
[14:05:08] <godog>	 to undrain 2 osd, wait, then repeat
[14:14:59] <volans>	 I've re-created my lima-kilo to make sure I didn't had any local quirks and now it doesn't start the minio pod for the loki-tracing instance, sigh
[14:15:02] <volans>	 in the debug log I get:
[14:15:02] <volans>	 loki-tracing-minio-post-job: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
[14:15:16] <volans>	 but no obvious error or logs taht I can see
[14:17:14] <volans>	 does it ring any bell by any chance?
[14:18:18] <andrewbogott>	 godog: that command looks right to me, I would add --batch-size 1 if you haven't already started it.
[14:19:10] <godog>	 andrewbogott: I haven't! jumping in a meeting rn then will kick off the cookbook, thank you
[14:29:59] <taavi>	 next up: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/96
[14:54:53] <volans>	 and now after a restart it doesn't even start the VM,... recreating
[14:56:20] <dcaro>	 that's weird :/, I can restart it, though it takes a while to start all the pods correctly (it kinda tries to start everything at the same time, until things start in the right order)
[14:57:05] <volans>	 it was failing to get the disk
[14:57:06] <volans>	 of the VM
[14:58:49] <dcaro>	 oh, never had that error
[14:59:23] <volans>	 the user facing error was waiting for ssh to work
[15:00:11] <volans>	 the log had something about the disk, but re-creating I think it has overriden it, so I lost it, sorry
[15:04:19] <volans>	 dcaro: btw the prompt changed (was green earlier) and I also got a missing ~/bin in PATH so when logging into lima-kilo:
[15:04:25] <volans>	  /home/volans.linux/.bashrc: line 9: kubectl: command not found
[15:04:37] <volans>	 I didn't check it yet if there was any recent change related
[15:41:27] <taavi>	 patches I was talking about: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/96 and 97, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197308
[15:42:44] <dcaro>	 👍
[15:44:42] <andrewbogott>	 godog: answering your question about preseed-test, I'm running the test itself on a trixie VM and trying to create a trixie VM (virt-on-virt). IIRC it worked properly when the guest was Bookworm
[15:45:14] <andrewbogott>	 But this can wait until you have time to think about it :)
[15:46:13] <godog>	 andrewbogott: ok! would you mind sharing the full output and possibly one with 'set -x' added to the top of test.sh ?
[15:46:57] <andrewbogott>	 sure. I'll start that now but it'll take a while
[15:47:13] <godog>	 sure np, I'll get going with undrain_node 
[15:48:46] <andrewbogott>	 ...or maybe not since it fails right away!
[15:49:26] <andrewbogott>	 I think I'm wrong and it's the 4-drive thing that breaks it, not the trixie thing
[15:49:41] <andrewbogott>	 Sorry, I remember that it worked for me once on Friday, I think that was with all defaults (which would be 2 drives)
[15:49:42] <andrewbogott>	 https://phabricator.wikimedia.org/P84205
[15:50:36] <andrewbogott>	 bonus:
[15:50:40] <andrewbogott>	 https://www.irccloud.com/pastebin/aJNQYOj8/
[15:51:32] <godog>	 andrewbogott: ack thank you, my local system is trixie now and I'll try to reproduce
[15:51:40] <andrewbogott>	 thx
[16:02:14] <dcaro>	 did any of you do anything to cloudcontrol1007?
[16:07:23] <andrewbogott>	 not me
[16:07:26] <andrewbogott>	 what's it doing?
[16:08:21] <dcaro>	 `JobUnavailable wmcs (maintain_dbusers_eqiad cloud critical eqiad prometheus)` triggered for a second, it has happened a couple times before
[16:10:44] <godog>	 andrewbogott: fix is https://gitlab.wikimedia.org/repos/sre/preseed-test/-/merge_requests/6
[16:11:18] <godog>	 andrewbogott: also I confirm sudo is not involved, I'm running all commands in the README without it
[16:12:20] <andrewbogott>	 oh, cool
[16:14:00] <andrewbogott>	 oh, it's doing things now!
[16:15:00] <godog>	 sweet
[16:17:06] <godog>	 undrain_node is running under screen with my user on cloudcumin1001, I take it I should leave it there do to its thing and will put all osds in service over time
[16:20:47] <andrewbogott>	 that's the idea! There are some timeouts so it might just time out midway through but it hasn't done that for me lately.
[16:21:11] <godog>	 ok thank you
[16:22:23] <andrewbogott>	 I mention because we're working with new servers that are much bigger than the ones we've tested with the most
[16:22:37] <andrewbogott>	 so things that took a few hours on old servers now take all day
[16:24:38] * godog nods
[16:26:48] <andrewbogott>	 running a debian VM on another devian VM which is itself running on hardware emulation is... not fast
[16:27:03] <andrewbogott>	 But we've got to keep those hardware vendors in business somehow
[16:31:06] * dhinus off
[16:46:47] <dcaro>	 andrewbogott: for toolforge, I think this is the current raw usage:
[16:46:49] <dcaro>	 https://usercontent.irccloud-cdn.com/file/uczfNbFx/image.png
[16:47:07] <andrewbogott>	 thank you!
[16:48:37] <andrewbogott>	 That's actual, not requested?
[16:48:41] <dcaro>	 yep
[16:48:53] <dcaro>	 using: `(sum (irate(node_cpu_seconds_total{project="tools",mode!~"idle|guest|guest_nice", instance=~"tools-k8s-worker-.*"}[$__rate_interval])))`
[16:48:57] <andrewbogott>	 So about 4 GB of RAM per core
[16:49:06] <dcaro>	 and ` sum(node_memory_MemTotal_bytes{project="tools",instance=~".*-k8s-worker-.*"} - node_memory_MemAvailable_bytes{project="tools",instance=~".*-k8s-worker-.*"})`
[16:49:32] <dcaro>	 a bit more yes
[16:52:33] <andrewbogott>	 those cpus are probably hyperthreaded cores though, right? So it's actually max 44 physical cores...
[17:04:49] <dcaro>	 I'd say so, have not checked
[17:05:55] <andrewbogott>	 I put those numbers on the task but I'm still confused about what 'config C' is normally
[17:41:33] <bd808>	 d.hinus: Nice find on that MTU mismatch. I had totally written that problem off as github edge traffic restrictions.
[17:57:19] <bd808>	 d.caro: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/cinder-csi-plugin/using-cinder-csi-plugin.md is the cinder plugin that Magnum provisions.
[17:59:37] <dcaro>	 👍 thanks!
[18:01:48] * dcaro off