[07:39:48] jelto: me and dcaro. We are traveling today [07:40:07] RE: kubecon [09:50:17] hi folks [09:50:56] qq - I see that docker-pkg supports known_uid_mappings, but IIUC there is nothing for gids.. Would it be a feature ok to add or is there anything against it? [09:52:07] I am asking since I am testing the amd gpu plugin, that attaches a /dev/kfd device to the containers to access the GPUs, usually with perms root:render [09:52:28] so for example I'd need the render group to be a fixed gid, to add users to it etc.. (like in my test case, nobody) [09:53:26] (still testing it so it may not work, but in case I'll also file a change for docker-pkg) [09:59:34] elukey: the reason why we have the uid mappings is because k8s won't understand non-numeric users [10:00:10] you should be able to get away with just adding the group indicating the group id explicitly if this is just a one-off [10:00:17] else we can look at docker-pkg [10:00:58] joe: ack, I am trying with an explicit groupadd render + usermod -a -G nobody render to see how it goes [10:01:59] (of course AMD's images all run as root to bypass the issue) [10:03:18] joe: ah btw, not sure if you saw this beauty - https://github.com/NVIDIA/k8s-device-plugin#prerequisites [10:03:28] I posted it while you were away, really amusing [10:03:42] I mean [10:03:47] nvidia, linux and k8s [10:04:00] just add a jvm daemon for fun [10:04:23] nvidia-container-runtime configured as the default low-level runtime [10:04:36] I swear I didn't see this before making the joke [10:05:01] the main problem is that nvidia's device driver for k8s is able to share a single GPU across multiple containers (time sharing), AMD only allows to attach a gpu to a container that keeps it until it is deleted :( [10:05:45] I am really against nvidia in general but we (as ML) are planning to buy GPUs and currently nvidia is superior on paper [10:06:11] the list of requisites scares me a lot, especially because those are all binary only things [10:06:37] anyway, I am open for brainstorms and suggestions at the moment [10:06:49] I think the problematic part is indeed introducing a proprietary dependency [10:08:24] also we are planning to move away from docker, no idea how that would play with the nvidia-* prerequisites lockin [13:58:38] If dealing with nvidia GPUs in k8s is anything like dealing with nvidia GPUs on a Linux laptop I would encourage us to explore the AMD or Intel options first. [14:22:56] intel really isn't an option [14:25:40] I realize this is not a production use case, but I've been using an nvidia GPU with nvidia-container-runtime for a while and it's not that bad (except the proprietary blob stuff), but I haven't tried slicing the card (I don't think my GPU supports it actually) [14:28:17] claime: if you have time during the next weeks to test the time slicing (in theory a lot of cards are enabled, it is a CUDA/software thing) [14:28:24] I'd be super interested [14:28:30] even a quick test etc.. [14:28:50] or we could set up some time to work on it if you are willing to :) [14:30:32] It's not a k8s setup though, so idk how slicing works on only docker [14:32:39] joe: is that because their cards are not performant? [14:34:13] jhathaway: last I checked there was no support in TF [14:34:31] maybe that changed [14:34:35] ah, TF == TensorFlow I presume? [14:34:40] yep [14:38:35] I don't know how well it works, but there does seem to be active development on a plugin, https://github.com/intel/intel-extension-for-tensorflow [14:39:57] https://blog.tensorflow.org/2021/06/pluggabledevice-device-plugins-for-TensorFlow.html [14:42:10] heh yes 2021 qualifies as "after I last checked" [14:42:41] :) [14:42:55] very interesting, I didn't know about the plugin [14:48:57] ROCm/AMD does it in a different way: https://pypi.org/project/tensorflow-rocm/ [14:49:30] so they forked to be able to add all the AMD libraries into, not sure if they are thinking about using a plugin etc.. [14:50:01] but the downside is that if people want to use Pytorch, we'd be out of the games with Intel (IIRC there was not option for it) [14:50:57] same thing for other frameworks etc.. [14:51:34] AMD has a nice library to "translate" CUDA to their stack, that is promising, but sadly every project start with nvidia first, then AMD if really needed/asked [14:52:09] I see this, https://github.com/intel/intel-extension-for-pytorch, but I don't know how it compares to the AMD support [14:53:18] I'd be really happy if this was a new standard [14:53:32] Well that's what having a sub-10% market share accross the board gets you, unfortunately [14:53:41] (for discrete GPU)