[07:39:48] <arturo>	 jelto: me and dcaro. We are traveling today
[07:40:07] <arturo>	 RE: kubecon
[09:50:17] <elukey>	 hi folks
[09:50:56] <elukey>	 qq - I see that docker-pkg supports known_uid_mappings, but IIUC there is nothing for gids.. Would it be a feature ok to add or is there anything against it?
[09:52:07] <elukey>	 I am asking since I am testing the amd gpu plugin, that attaches a /dev/kfd device to the containers to access the GPUs, usually with perms root:render
[09:52:28] <elukey>	 so for example I'd need the render group to be a fixed gid, to add users to it etc.. (like in my test case, nobody)
[09:53:26] <elukey>	 (still testing it so it may not work, but in case I'll also file a change for docker-pkg)
[09:59:34] <joe>	 elukey: the reason why we have the uid mappings is because k8s won't understand non-numeric users
[10:00:10] <joe>	 you should be able to get away with just adding the group indicating the group id explicitly if this is just a one-off
[10:00:17] <joe>	 else we can look at docker-pkg
[10:00:58] <elukey>	 joe: ack, I am trying with an explicit groupadd render + usermod -a -G nobody render to see how it goes
[10:01:59] <elukey>	 (of course AMD's images all run as root to bypass the issue)
[10:03:18] <elukey>	 joe: ah btw, not sure if you saw this beauty - https://github.com/NVIDIA/k8s-device-plugin#prerequisites
[10:03:28] <elukey>	 I posted it while you were away, really amusing
[10:03:42] <joe>	 I mean
[10:03:47] <joe>	 nvidia, linux and k8s
[10:04:00] <joe>	 just add a jvm daemon for fun
[10:04:23] <joe>	 nvidia-container-runtime configured as the default low-level runtime
[10:04:36] <joe>	 I swear I didn't see this before making the joke
[10:05:01] <elukey>	 the main problem is that nvidia's device driver for k8s is able to share a single GPU across multiple containers (time sharing), AMD only allows to attach a gpu to a container that keeps it until it is deleted :(
[10:05:45] <elukey>	 I am really against nvidia in general but we (as ML) are planning to buy GPUs and currently nvidia is superior on paper
[10:06:11] <elukey>	 the list of requisites scares me a lot, especially because those are all binary only things
[10:06:37] <elukey>	 anyway, I am open for brainstorms and suggestions at the moment
[10:06:49] <joe>	 I think the problematic part is indeed introducing a proprietary dependency
[10:08:24] <elukey>	 also we are planning to move away from docker, no idea how that would play with the nvidia-* prerequisites lockin
[13:58:38] <jhathaway>	 If dealing with nvidia GPUs in k8s is anything like dealing with nvidia GPUs on a Linux laptop I would encourage us to explore the AMD or Intel options first.
[14:22:56] <joe>	 intel really isn't an option
[14:25:40] <claime>	 I realize this is not a production use case, but I've been using an nvidia GPU with nvidia-container-runtime for a while and it's not that bad (except the proprietary blob stuff), but I haven't tried slicing the card (I don't think my GPU supports it actually)
[14:28:17] <elukey>	 claime: if you have time during the next weeks to test the time slicing (in theory a lot of cards are enabled, it is a CUDA/software thing)
[14:28:24] <elukey>	 I'd be super interested
[14:28:30] <elukey>	 even a quick test etc..
[14:28:50] <elukey>	 or we could set up some time to work on it if you are willing to :)
[14:30:32] <claime>	 It's not a k8s setup though, so idk how slicing works on only docker
[14:32:39] <jhathaway>	 joe: is that because their cards are not performant?
[14:34:13] <joe>	 jhathaway: last I checked there was no support in TF
[14:34:31] <joe>	 maybe that changed
[14:34:35] <jhathaway>	 ah, TF == TensorFlow I presume?
[14:34:40] <joe>	 yep
[14:38:35] <jhathaway>	 I don't know how well it works, but there does seem to be active development on a plugin, https://github.com/intel/intel-extension-for-tensorflow
[14:39:57] <jhathaway>	 https://blog.tensorflow.org/2021/06/pluggabledevice-device-plugins-for-TensorFlow.html
[14:42:10] <joe>	 heh yes 2021 qualifies as "after I last checked" 
[14:42:41] <jhathaway>	 :)
[14:42:55] <elukey>	 very interesting, I didn't know about the plugin
[14:48:57] <elukey>	 ROCm/AMD does it in a different way: https://pypi.org/project/tensorflow-rocm/
[14:49:30] <elukey>	 so they forked to be able to add all the AMD libraries into, not sure if they are thinking about using a plugin etc..
[14:50:01] <elukey>	 but the downside is that if people want to use Pytorch, we'd be out of the games with Intel (IIRC there was not option for it)
[14:50:57] <elukey>	 same thing for other frameworks etc..
[14:51:34] <elukey>	 AMD has a nice library to "translate" CUDA to their stack, that is promising, but sadly every project start with nvidia first, then AMD if really needed/asked
[14:52:09] <jhathaway>	 I see this, https://github.com/intel/intel-extension-for-pytorch, but I don't know how it compares to the AMD support
[14:53:18] <elukey>	 I'd be really happy if this was a new standard
[14:53:32] <claime>	 Well that's what having a sub-10% market share accross the board gets you, unfortunately
[14:53:41] <claime>	 (for discrete GPU)