[06:01:41] ERROR: Configuration file not available. [06:01:41] ERROR: /home/somebody/.s3cfg: None [06:01:41] klausman when deploying in staging. Also noticed extra / in s3 URL: [2025-07-09 06:00:50] Downloading using s3cmd: s3://wmf-ml-models/mint/20250514081434//nllb/nllb200-600M.tgz [06:01:41] URL thing, I'll fix in the next patch. [06:29:48] Interesting. We're getting s3 URL, which can be probably only possible when s3 config is correct. [06:41:39] kart_: o/ I think we discussed this in the patch IIRC, s3cmd needs `s3cmd --configure` otherwise it fails [06:41:54] or a proper config file rendered somewhere [06:53:49] elukey: isn't that interactive command? [06:54:13] I think issue seems script not finding it. Looking at it again. [06:55:20] kart_: I don't recall if it was interactive, but I thought you and Tobias tested the command :D [06:57:06] ok I see, you do cat > ~/.s3cfg < is it being created? [06:59:14] Yes, but s3cmd is not finding it and then it is listing s3 bucket. [06:59:42] Yes, we tested with stat machine [07:01:02] okok perfect, can you try to redeploy so I can inspect the pod? [07:01:37] otherwise I can do it [07:04:40] I see you have already your hands full with another deployment, lemme try it [07:05:53] I was about to say that ;) [07:06:36] We can probably use --config configpath to make sure it is correct. I can improve that part, mostly ready. [07:09:28] I cannot see if the container has the home config via kubectl, the container is not available [07:09:31] mmmm [07:10:38] because one obvious things that may happen is that ~/.s3cfg is not under /home/somebody [07:12:06] another improvement, while I am reading the patch [07:12:07] host_base = https://thanos-swift.discovery.wmnet [07:12:07] host_bucket = https://thanos-swift.discovery.wmnet [07:12:20] these two would need to be configurable, if we move away from thanos etc.. [07:12:28] we could use env variables as well [07:13:26] the other thing that I don't recall is if we tried to use s3cmd with -c /some/path directly [07:13:33] rather than relying on the home dir's config [07:13:59] Yes. That's bit difficult to guess, so we need to set it as well. [07:14:07] Sometime it maybe /root [07:14:51] s3cmd --config "$CONFIG_PATH" get "$url" "$dest_path" [07:15:24] export HOME="${HOME:-/root}" [07:15:25] CONFIG_PATH="${CONFIG_PATH:-$HOME/.s3cfg}" [07:15:25] etc [07:19:53] kart_: so IIUC with -c it works without requiring anything in the home dir right? If so I'd simply add the config file under something like /etc/s3cfg via deployment-charts [07:20:11] and the entrypoint.sh should just point to that file (configurable via env var if possible) [07:26:35] I think that will be better. Meanwhile, I've submitted possible fix, https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1167448 [07:27:08] Not sure how entrypoint.sh will deal with that file though. [07:29:12] I have some interviews this morning but if you want I can try to file a patch for the /etc/etc.. approach in deployment-charts, should be really simple [07:29:57] That would be helpful. [07:30:15] Thanks :) [07:30:26] np, I'll ping you when ready :) [07:30:42] cool [07:30:59] ack [07:31:03] (reading backlog) [07:31:12] also, morning :) [07:35:55] I agree with Luca that we should maybe make the cfgfile sit in /etc by default, and make it configurable by env var. [07:36:48] Though it's a bit puzzling why s3cmd can't find it. [07:37:39] o/ good morning [07:43:22] klausman: I suspect that the entrypoint.sh is not executed as user somebody, but we'd need to check that [07:45:21] Moin klausman [07:45:40] I've to go for some work and lunch, should be catch up after that.. [08:31:37] morning! [08:32:21] * aiko have an appointment, back in 1h [09:16:57] o/ fyi: https://phabricator.wikimedia.org/T399066 [09:18:55] ^--- DPE SRE is helping setup permissions [09:18:55] morning morning [09:23:29] klausman: qq - are ml-serve1012+ the new hosts with the GPUs? [09:24:31] because the BIOS config is totally different from the others [09:24:38] * elukey cries in a corner [09:26:07] Yes, they are the 8xGPU machines [09:26:22] * elukey cries in a corner [09:26:38] Is it at least the same BIOS manufacturer? [09:28:18] not sure, but via redfish the options are different, so I imagine we'll have to figure out how/why and adapt provisioning for it [09:31:26] I saw in the SRE meeting notes that you already have contact with some Redfish person at SMC, maybe they can help? [09:34:01] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987482 (10elukey) I checked the BIOS configs via Redfish and they are different from what we expect, the cookbook fails since we expect `BootModeSelect` to be present... [09:35:06] sure I can try to reach out, but these are very custom hosts so I think we'll need to adapt discovering the right config for our process. I think they do work only with UEFI, so this may be the big difference [09:35:42] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10987484 (10klausman) >>! In T393948#10985332, @Jclark-ctr wrote: > @klausman Will this be legacy or uefi? it is reachable We don't have a particular preference for... [09:35:45] Yes, I replied on the above bug re:E FI [09:36:35] what is the timeline for these nodes? I mean, how urgent are they etc.. [09:39:17] Not needed-next-week. [09:39:42] sure but what is the timeline? :) [09:40:03] there are quite a few things to do to make them working, this is why I am asking [09:40:09] we'd need a bit of planning [09:40:18] I think if they're in the cluster and serving by mid-August, that's enough [09:42:19] let's aim for Sept, and the serving part may be also delayed a little too [09:42:30] other than make them working up to being able to reimage [09:42:47] we'll also have to package and deploy the new GPU plugin, test it etc.. [09:42:55] ack, i'd rather have the machines later if that means re-imaging is rock solid. [09:49:41] John just told me that the new hosts have a ton of nvme drives, so they must work with UEFI [10:00:36] * aiko back! [10:08:40] klausman, elukey: o/ should we merge this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1165850 and deploy/test in experimental ns in staging first? [10:17:05] when we confirm the queue proxy image is working properly in staging, we can proceed with the user/credential+queue proxy changes in prod cluster [10:29:50] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1167565 need review for the edit check image update whenever anyone has time :) [10:52:07] aiko: the change is deployed on ml-staging, in theory you should see a diff for every namespace now [10:52:42] mmmm no uff [10:53:53] ah wow the pods are recycling themselves [10:54:33] Image: docker-registry.discovery.wmnet/knative-serving-queue:1.7.2-7 [10:55:42] aiko, klausman - ok so the knative control plane reschedules all the pods as soon as the new setting is deployed via admin_ng [10:56:16] that is good, since we don't have to deploy manually, but somehow bad since we'll likely have to depool the prod clusters one at the time when we apply the change (to be on the safe side) [10:56:47] does it make sense? [11:08:44] yeah, makes sense. [11:09:09] On one hand the auto-deply is kinda annoying since it restarts the world, OTOH, we have the upside that we're less likely to miss some service [11:12:01] we'll have to redeploy again for the storage initializer's credentials though [11:12:52] Ah, since the env is not re-evaluated. Bummer [11:14:57] yep [11:15:16] but if we depool we can probably have a quicker pace in re-deploying [11:15:55] Yeah, also if we wait a few moments after depooling, all the autoscaling should reduce the # of running pods for lack of traffic [11:16:18] Not increasing speed a ton,. but every bit helps [11:23:02] elukey: the drain for codfw-prod would be `confctl --object-type discovery select 'dnsdisc=inference,name=codfw' set/pooled=false` right? [11:27:47] klausman: should we also attempt, https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1167448 before moving to conf file based approach? [11:34:49] meta-l,meta-lI'll add some thoughts to the patch on Gerrit. [11:34:59] oops, where did that meta-l come from... [12:00:47] :) [12:30:31] klausman, elukey: ack! so are we gonna to do codfw-prod first? depool -> apply the knative change -> wait for all the pods rescheduled -> deploy credential change namespace by namespace [12:30:38] are these steps correct? [12:30:56] we should also repool at the end ;) but yes [12:33:06] right! [12:41:03] klausman: I'll redeploy and see what logs says about s3 config file path [12:41:17] Roger [13:28:11] klausman: yep I think it should work! [13:28:25] aiko: let's test staging with httpbb first, just to make sure [13:30:22] (currently in a meeting, but will keep an eye here) [14:39:41] elukey: yes good point! [14:40:19] https://www.irccloud.com/pastebin/VjjQx3VY/ [14:41:32] all right it looks good! Just to be sure, could you check if the queue proxy's log are still god on a couple pods? [14:41:38] just to be sure that we are not logging horrors [14:49:52] https://phabricator.wikimedia.org/P78843 checked a couple pods, only saw logs of starting queue-proxy, nothing else [14:51:46] quiet is good :) [14:52:09] all right so we can in theory proceed with depooling codfw in prod and apply the helmfile change in admin_ng [14:52:28] that will take a bit though, so time-wise it may make sense to start the work tomorrow morning [14:54:06] that makes sense! tomorrow morning works for me [14:57:19] klausman: would you be available then? maybe we can start around 10am? [14:57:33] sounds good! [14:59:39] alrightyyy let's do it :D [18:28:31] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Build Peacock Model retraining pipeline - https://phabricator.wikimedia.org/T393103#10989486 (10ppelberg)