[06:55:22] greetings [07:25:58] FYI minio is now "source-only" and will not provide binaries: https://github.com/minio/minio/blob/master/README.md#source-only-distribution [07:27:17] https://github.com/minio/minio/issues/21647 for the rants ;) [07:27:32] lol [07:40:57] I'm about to add more osds from cloudcephosd1050, this time with batch size 2, unless there are objections [07:41:08] cookbook wmcs.ceph.osd.undrain_node --cluster-name eqiad1 --osd-hostname cloudcephosd1050 --task-id T405478 --batch-size 2 [07:41:08] T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478 [07:51:02] I'll take that as a no objections [07:51:45] I've not enough context to have objections, sorry :D [07:52:24] all good volans, no worries [08:01:21] morning [08:03:46] hey [08:04:33] all good so far re: undrain_node with batch-size 2, network peaked briefly at ~20gbit between f4 and c8 and that's it afaics [08:04:55] https://grafana.wikimedia.org/goto/25chSHRDg?orgId=1 [08:07:25] bbiab [08:09:45] nice [08:11:34] we might want to change the `ods internal/external` graphs, might be easier to have two, one for hosts with 25G and one for ones with 10G, so the limits can be seen nicely [08:13:28] (that's a note for myself) [08:16:52] ok! makes sense [08:29:47] hello [09:17:06] quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198002 I'm removing an old key that's unused now [09:17:17] (personal ssh key) [09:18:27] +1ed [09:19:57] still looking for a stamp https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/96 [09:20:18] thanks! [09:20:27] oh yes, started looking into that one [09:22:21] got lost trying to figure out where the vars came from [09:23:43] looking at the rendered state instead LGTM +1d [09:25:43] oh, the other change now does not show the results of the run :/, I'll rerun the pipeline [09:25:52] ` You should run a new pipeline, because the target branch has changed for this merge request. ` [09:26:36] hmm [09:26:38] https://usercontent.irccloud-cdn.com/file/c38oDv7E/image.png [09:26:56] the "run pipeline" button now says "add a to-do item", how do I trigger the pipepline? [09:28:00] ok... had to go to the side pannel, build->pipelines-> run new pipeline and select the branch [09:39:14] taavi: +1d the other one too [09:39:23] was there any others? [09:39:35] *were [09:40:07] thanks, you already gave a +1 on the other patch. I'll merge those together later in the day [10:01:35] taavi: loki-tools.loki.svc.cluster.local should work also locally on lima-kilo if I become tf-test, right? [10:02:15] because I'm getting curl could not ersolve host [10:35:19] volans: $CLUSTER.local addresses are kubernetes internal ones, so they will only work from pods inside the cluster [10:36:52] volans: note that in toolsbeta/tools that will be loki-tools-reader (for queries [10:36:58] and loki-tools-writer for writting [10:37:27] I was adding some diagrams here today https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Logs_Service [10:37:46] s/reader/read s/writer/write [10:38:53] some of that seems to be duplicating https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Logging [10:39:47] dcaro: yes, hence my question, when I've become tf-test I'm inside k8s no? [10:39:56] so shouldn't that work? [10:40:15] no, `become` is a fancy `sudo` alias, you're still on the lima-kilo VM itself [10:40:40] easiest way to get a shell inside k8s is to `toolforge webservice shell` [10:40:44] you can run `toolforge webservice shell` [10:40:45] yep [10:40:46] that [10:40:55] ahhh ok [10:41:38] thx [10:50:56] oh, I think I killed tools prometheus xd [10:50:57] sorry [10:51:16] and it's back up, nice [10:57:11] * dcaro lunch [14:06:14] godog: given that preseed-test can't reproduce my issue... do you (or does anyone) have thoughts about how to figure out what's happening with my grub partition? Best I can tell the grub rescue> isn't going to tell me anything [14:06:40] It's looking for a uuid-named partitions but 'ls' only shows human-named partitions; I'm not sure if that's a clue or if that's totally expected. [14:09:23] andrewbogott: mmhh I don't have any thoughts atm, though I do have some time now to take a look [14:11:31] I think the console is at the grub prompt right now [14:11:43] also... here's the state of partitioning before reboot: [14:11:49] andrewbogott: did I get it right from the task that a reimage via cookbook goes back to square one (i.e. grub doesn't know how to boot) ? [14:11:58] https://usercontent.irccloud-cdn.com/file/0M6ZV0Vk/grubinstall1.png [14:12:03] https://usercontent.irccloud-cdn.com/file/c2bpFdib/grubinstall2.png [14:12:08] yes I'm on console and grub is at its rescue prompt [14:12:52] I'm not sure I understand your question, but I've tried this 3-4 times from the cookbook and the results are the same. I'm using a slightly modified partman recipe for the last attempt but the grub rescue prompt was the same every time [14:13:26] ok thank you, yes that answers my question [14:13:27] I'm confused by grub rescue wanting a partition ID that didn't seem to exist during install... [14:13:51] but I'm not sure grub is using the same ID scheme [14:14:03] since grub tries to be OS-neutral I think it's off in its own world [14:17:56] mmhh ok I'll kick off a reimage [14:18:32] you might want to go back to the normal partition recipe? It's very possible that my changes have somehow made it worse even though the symptom is identical. [14:18:52] good point yeah, I'll revert to the standard recipe [14:18:59] also: I'm not picking on you in particular to fix this; I'm just out of ideas and you touched the recipe most recently :) [14:19:15] andrewbogott: do you happen to have the cookbook reimage command line handy [14:19:26] sure no worries, all good [14:19:35] andrew@cumin2002:~$ sudo cookbook --no-locks sre.hosts.reimage --os trixie cloudcontrol2010-dev --new [14:19:44] thank you [14:20:02] --no-locks is for when I get impatient and ctrl-c mid-cookbook [14:21:07] btw ceph seems happy with cloudcephosd1050 [14:22:00] indeed [14:22:01] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198082 [14:22:22] +1 [14:22:36] I'm never 100% sure where to merge those patches before reinstalling. install2004 maybe? [14:23:16] yes iirc that's correct [14:23:34] um... 2005 [14:23:47] I guess install* is probably the right answer :) [14:23:51] want me to do all that? [14:24:12] thank you, I'll run puppet there [14:46:12] andrewbogott: d-i is at the 'update-grub' stage and it is taking much longer than I expected, did it do the same for you? [14:46:51] I don't think I had expectations for how long it would take [14:47:10] but maybe! There were certainly some messages it throws about efi which I pasted in the task [14:47:21] ok thank you [14:47:53] I think you can see the grub install logs on tab 4? [14:49:04] really I want the installer to pause at the end, right before reboot, so we can see what we've got [14:49:10] not sure if that's possible [14:59:11] ok time is up for me at least now, I don't have any good leads atm [14:59:49] may try again tomorrow, though andrewbogott I logged off the console, you can test/poke at will if you wish [15:00:07] godog: did it land at grub rescue? [15:00:17] andrewbogott: yes it did [15:00:28] Welcome to GRUB! error: disk `lvmid/8T1qbg-8jkl-YiRW-sUpS-GW2a-F9cz-JNP1PG/S9ST3M-jR71-cmDR-EJmn- X4Kg-pXc9-WmTorL' not found. [15:01:09] is that a space in the name? [15:05:57] godog: ok, same then. Thanks for looking, have a good night! [15:14:43] what's the easy way to check the current live full config of a pod in lima-kilo? [15:20:00] found what I needed [15:23:43] xd, ack [15:44:41] andrewbogott: cheers, I'll poke at it one more time for today [16:05:50] ok, I'm clearly missing something obvious wrt loki-tracing [16:06:40] from kubectl get deployment I see a volume with the secret bein created "secretName": "loki-tracing-basic-auth" [16:07:09] but then the nginx config in the gateway has auth_basic_user_file /etc/nginx/secrets/.htpasswd; [16:07:32] but there is no /etc/nginx/secrets/ directory created [16:13:28] is a volume added to the k8s pod? [16:14:19] one that would appear in df like a normal volume? [16:14:27] only /run/secrets/kubernetes.io/serviceaccount [16:14:49] i mean in the k8s pod yaml :-) [16:15:11] yes [16:15:12] "name": "auth", [16:15:12] "secret": { [16:15:12] "defaultMode": 420, [16:15:12] "secretName": "loki-tracing-basic-auth" [16:15:15] } [16:15:22] if I got what you mean :) [16:16:21] it's under items.0.spec.template.spec.volumes [16:16:49] there should be a corresponding block under the container config [16:17:27] correct [16:17:28] { "mountPath": "/etc/nginx/secrets", "name": "auth" [16:17:28] }, [16:17:39] under volumeMounts [16:18:03] and another one for nginx, possible that they conflict? [16:18:10] "mountPath": "/etc/nginx", "name": "config" [16:18:11] you have those, the pod starts, but then that directory does not exist? [16:18:15] that could be it. [16:18:16] correct [16:18:35] I wonder if the nginx one shadows the secret one [16:19:01] but all this is from upstream, not my work [16:19:06] I just "enabled" bits via config [16:19:54] https://github.com/grafana/loki/blob/main/production/helm/loki/templates/gateway/deployment-gateway-nginx.yaml#L93 [16:20:58] I followed more or less https://github.com/grafana/loki/blob/main/docs/sources/setup/install/helm/deployment-guides/aws.md [16:24:43] in case you want to check my code that's https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commits/loki-tracing [16:38:02] I'm starting to think that it might be worth it to create a component repo, and instead of trying to twist the loki chart to create custom resources, just create a new chart with the plain templates [16:38:54] I'm thinking also on moving the current logging component for tools to something like that, it also help keep track of the versioning better [16:39:33] you mean not using upstream at all? [16:40:36] nono, just not using upstream to create the ingress, and just add a plain deployment + whatever needed [16:41:17] instead of trying to get upstream chart to create the ingress through their configuration settings [16:41:28] but they have everything already done (nginx proxy, basic auth, using a secret) [16:41:45] it's a supported use case described in their guide to deploy to AWS for example [16:42:50] yep, though I'm not sure they support deploying two instances inside a toolforge cluster xd [16:43:36] just saying, that at some point if you find yourself spending too much time debugging the upstream chart to deploy the objects that are not core to the product, maybe better do something like https://gitlab.wikimedia.org/repos/cloud/toolforge/wmcs-k8s-metrics and just plainly define them in their own templates [16:44:41] give my noob level in k8s magic I'm not sure if it's a me-problem or an upstream limit :) [16:44:44] *given [16:45:00] fair enough :) [17:00:11] * dhinus off [17:31:26] * dcaro off