[08:44:44] morning :) [08:55:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10kevinbazira) [08:56:11] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10kevinbazira) @kostajh, we published datasets for all 21/22 models that passed the evaluation in this round. [09:03:14] \o G'day [09:16:23] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) Yes, my plan was to elaborte on my write up a bit (it's mostly for sorting my thoughts), and then use the template you mentioned and develop that into something like the API GW SLO (with plenty SRE input). [09:30:29] hi folks! Didn't write anything this morning, forgot to check the chan :) [09:31:23] \o [09:50:16] aiko: o/ I created https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/909196 as initial test for the AMD GPUs on DSE [09:50:42] I basically followed https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml, suggested by AMD upstream [09:50:56] we can expand the tests with pytorch etc.. of course [09:51:05] but it seems a good first use case, lemme know :) [09:57:40] ok so I have filed the code changes to hopefully make gpus working on dse [09:57:47] we'll see how it goes [10:05:13] I've had a look, and it LGTM, will add a +2 in a bit [10:05:48] The baremetal vs. in-a-privileged pod thing is a bit quirky, but likely the best approach. [10:18:28] klausman: what do you mean with "quirky"? [10:18:51] In the sense that it's not what upstream tells people to do [10:19:30] I.e. it's a quirk we have compared to others. I borrowed the term from the Linux kernel were some devices work _mostly_ like all the others in the class (e.g. USB Audio), but have quirks. [10:20:59] my understanding of what upstream proposes is that they want to allow people to deploy daemons via k8s, it is definitely more indicated if you don't have control on the bare metal nodes [10:21:10] but allowing root + priviledged pods seemed a little risky [10:21:31] we just need to run a daemon basically [10:21:32] Ack. Plus, some people may have potentially-evil tenants [10:22:47] the explanation is in https://v1-23.docs.kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/ [10:23:19] so the device plugin is a little go grpc service, that contacts the kubelet and that the kubelet can contact (via unix sockets) [10:23:47] I can see why upstream builds things as daemonset, a lot of users don't have access to k8s nodes [10:25:12] IIUC, this is purely for discovery of resources, the pod with GPU code would still talk to the GPU directly (after the kubelet has allocated the GPU to it), right? [10:27:17] this is the unclear part, I don't have a 100% solid idea about what happens (namely if docker/kubelet act as proxy for these devices) [10:27:59] the /dev/kfd device needs the user to be in a specific group "render" to access it, so if the container access it directly we may see some issues [10:28:18] (we use a special udev rule for stat100x to allow analytics users to be automatically in "render") [10:28:23] Yeah, I have wondered about the specifics of GPU-pod comms for a while [10:28:54] My best guess is that the kubelet upon allocation makes a magic device visible in the pod. [10:29:30] Of course the user mapping inside of the pod is under control of the k'let, so it could manage the permissions accordingly [10:30:00] it is probably docker at this stage that "owns" the device and maps it to the container, we'll see with what perms [10:30:10] maybe we'll need to special settings in its config [10:32:20] going afk for lunch! [10:32:30] \o [10:45:14] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [12:23:24] <- Lunc & errands [13:22:40] elukey: the patch looks good! one q - I saw you removed the pkg version for tensorflow-rocm. In the first patch, you specified version 2.11.0.540, isn't it needed? [13:27:01] aiko: ah snap you are right, I forgot to re-add it [13:27:13] it works for now since it is the latest version but I need to fix it :( [13:27:14] thanks! [13:32:59] aiko: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/909256 [13:51:19] I was just deploying updates to codfw and it turns out orespoolcounter2004 is down, known issue? [13:51:52] ah, looking at Icinga since 3d 23h 49m 11s actually [13:54:24] ouch thanks for the ping, didn't really see it [14:01:24] elukey: one q about page_change - I checked eqiad.rc1.mediawiki.page_change and it seems all the events there are dummy events. when you tested it before, did you use real events? [14:02:23] aiko: you need to use the codfw ones, mediawiki is active in codfw right now so eqiad doesn't get anything (I got in the same trap and wonderered what was happening) [14:03:00] elukey: ooh thanks! [14:09:58] Good morning all [14:11:13] o/ [14:16:58] \o [14:48:50] heyo chris [15:17:21] klausman: for CapEx I was thinking to buy a minimum of 4 ml-serve worker nodes for each DC [15:17:33] not sure if we'd need to go up to 8 [15:18:16] Yes, four sounds like a good number. I don't see us needing eight all that soon (... he said, not knowing how wrong he was :)) [15:18:48] I have the same doubt, but we have to plan for a year, very difficult to figure out what will be needed [15:31:49] Well, our best guesstimate will have to do. [15:53:47] very interesting link: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html#comparison-time-slicing-and-multi-instance-gpu [15:54:07] so IIUC nvidia offers two way of multiplexing a GPU over multiple containrs [15:54:25] 1) MIG, namely splitting the GPU in sub-gpus, with safe memory boundaries etc.. (max 7 IIUC) [15:55:02] 2) Time sharing (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html#understanding-time-slicing-gpus), way less safe but you can potentially scale up more than the 7 gpus [15:55:10] or a combination of both [15:55:23] MIG seems to require certain Nvidia GPUs (like A100) that cost a fortune [15:55:26] Have you spotted any mention of AMD/ROCm? [15:56:19] not really, I found these concepts only in nvidia docs [15:56:50] Hrm. Well, at least it's something that is in k8s and not the driver. Vaguely more likely to be available for AMD as well (eventually) [15:56:56] there is also https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/ [15:57:50] yes time slicing should be something that AMD could add to their k8s device plugin, not sure how hard it is [15:58:06] and also not sure how well it performs [15:58:55] the nvidia plugin has some requirements that make me a little hesitant - https://github.com/NVIDIA/k8s-device-plugin#prerequisites [16:00:48] https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing also explains a little [16:01:02] it is mentioned CUDA Time slicing, plus [16:01:10] "However, nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the GPU memory and runs in the same fault-domain as of all the others (meaning if one workload crashes, they all do)." [16:01:23] I did some research over the weekend and haven't found a single cloud GPU providers (Paperspace, LambdaLabs, etc) that uses AMD GPUs [16:01:45] Hrm. I wouldn't so much mind pods seeing each other's memory, but one crash and they all go is very annoying [16:02:33] yes definitely, very brittle [16:04:41] I think we need to order one NVIDIA GPU right at the start of the next FY. and test the multiplexing [16:05:05] then we can make an informed decision about what we order later in the year [16:20:59] sure but the work to add nvidia is really big [16:21:07] at least in our prod infrastructure [16:21:26] or one of us could get the GPU and test it somehow [16:22:04] my only worry is that we'll import nvidia configs and drivers + k8s configs etc.. (at least a quarter of work, if not more) [16:22:11] to then just drop it [16:22:17] because it was good in paper [16:22:27] (rather then focusing on other projects) [16:22:57] for example, not really sure how to approach the import of non opensource / binary only software to our infra [16:23:12] (multiple angles, SRE/community/etc..) [16:30:15] https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/program/schedule/ [16:30:22] known faces in the talk :) [16:55:22] The potentially "wasted" effort on NV GPUs worries me as well. Naturally, there is a similar-but-smaller potential for the AMD side (we pour work into getting it to work but it ultimately fails), but I think putting in the work there is more justified in being the more acceptable approach from an OSS/Open Knowledge point of view. [17:03:47] going afk folks, have a nice rest of the day :) [17:03:55] \o