[11:26:30] <elukey>	 hello folks, I am reading https://github.com/RadeonOpenCompute/k8s-device-plugin since after the k8s 1.23 migration we may be able to test it
[11:26:48] <elukey>	 the "--allow-privileged=true" part is not great afais
[11:27:52] <elukey>	 the other main question mark is if more than one pod are able to use the GPU at the same time
[11:28:11] <elukey>	 it seems as if the gpu is attached to one pod at the time and that's it
[11:29:00] <elukey>	 Nvidia ha https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html that seems a little more advanced, but it is nvidia
[11:29:57] <elukey>	 https://github.com/NVIDIA/k8s-device-plugin is their repo
[11:30:02] <elukey>	 (TIL - nvidia-docker)
[11:30:41] <elukey>	 https://github.com/NVIDIA/k8s-device-plugin#configure-docker !!
[11:32:01] <elukey>	 what a mess
[11:59:14] <akosiaris>	 NVIDIA definitely allows sharing a GPU across multiple containers. See https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi
[11:59:33] <akosiaris>	 I don't know about AMD though
[12:01:12] <akosiaris>	 but the multi-instance GPU partitions thing from NVIDIA appears interesting. That fact that it appears largely transparent to the workloads (if they are CUDA) it's pretty nice. But it's NVIDIA... :-(
[12:02:52] <elukey>	 thanks for the link!
[13:24:26] <XioNoX>	 akosiaris: thx for the +1 on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/886329 next step is to figure out what approach we want to move forward with between that one and the one your experimented with
[13:25:32] <XioNoX>	 as there is no need (at least for now) to add the community if we don't do any router side action on it
[14:18:07] <akosiaris>	 yeah, we need to dig a bit deeper into it and figure out how much that unpredictability (for us) for those pods route is bad or not 
[14:26:44] <XioNoX>	 akosiaris: without BFD it looks sketchy as one node maintenance could impact a different one
[14:27:02] <akosiaris>	 and we are back in the BFD path...
[14:27:04] <akosiaris>	 lol
[14:27:24] <akosiaris>	 it was nice closing that task the other day, maybe we should re-open it
[14:27:49] <akosiaris>	 why would though maint on 1 node impact a different node ?
[15:14:24] <XioNoX>	 akosiaris: both node1 and node2 advertise node2 pod range, if the router picks node1 (which we can't easily control) to reach node2 pods, any work on node1 will impact node2 traffic
[15:14:49] <XioNoX>	 it would also be sub-optimal flows as node1 and node2 don't even have to be in the same row
[15:15:46] <akosiaris>	 they have to be in the same row, but point taken
[15:15:55] <akosiaris>	 it's graceful-restart all over again
[15:16:16] <akosiaris>	 I liked the quote "looks like it was designed by cowboys" btw
[15:16:52] <akosiaris>	 Maybe I should revive that BFD PR and see if I can even remotely make it happen upstream
[15:22:53] <XioNoX>	 ah right, yeah
[15:23:51] <XioNoX>	 akosiaris: it would be nice to have BFD, but now comparing the two approach, it might be easier to just have the CRs redistribute the same row prefixes
[17:54:39] <dduvall>	 XioNoX: re: netflow, i like those visualizations!
[17:55:33] <dduvall>	 i experimented briefly with linkerd locally and it had a really nice dashboard as well. istio not so much. it's... adequate
[17:59:22] <XioNoX>	 the advantage of netflow is that we could integrate it with all our other servers
[18:02:55] <XioNoX>	 install ipt-netflow not only on k8s servers but all the fleet, we already have a pipeline ready with https://phabricator.wikimedia.org/T263277 so the last step would be to add metadata from the k8s API