[08:09:30] https://github.com/NVIDIA/k8s-device-plugin#prerequisites [08:09:33] amusing [08:09:50] akosiaris: --^ [08:12:57] nvidia-container-runtime ? [08:14:16] The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use. [08:14:31] wait wtf [08:14:34] at least they provide a link to an install guide [08:14:42] I didn't check but I am pretty sure it is all binary only [08:14:49] I am sure too [08:14:58] Let me check something [08:15:06] this quarter I'll test https://github.com/RadeonOpenCompute/k8s-device-plugin on dse, since we have amd gpus in there [08:15:21] the main issue IIUC is that the GPU can be assigned to a single pod only [08:15:38] nvidia is the only one that has https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing [08:15:41] huh, yeah, I thought my unraid setup with nvidia GPU was not using that but it apparently is [08:16:10] so I am currently very sad [08:16:54] I am working with various people to give an heads up since my team is thinking to use/propose Nvidia hardware as exception for Lift Wing, but so far all the solution seem really not viable [08:17:29] (also I don't like the nvidia road but it is a sre/ml team decision in the end) [08:20:08] it's gonna be pretty painful probably [08:20:39] I've been running containers with GPU access for a while on my home setup, and concurrent access is a PITA. [08:20:50] nvidia binary drivers were a mess for a desktop environment without needing new features/bug fixes for a production env [08:34:03] claime: ah wow this is a good insight, did you use time sharing by any chance? [08:34:29] I see the gcloud uses something similar https://cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus [08:34:39] elukey: No, but I once by accident had two containers trying to use the GPU at the same time [08:34:45] And I got rewarded by a kernel panic [08:35:00] ahahahah yes [08:35:02] lovely [08:35:16] they warn exactly for the same problem [08:35:31] "you can access concurrently but.. not our problem if something breaks" [08:46:03] I had a few different problems [08:46:19] Sometimes it would kernel panic, sometimes one of the containers, or both, would lock up completely [08:46:35] Suffice to say, I stopped trying [08:48:37] I wonder how people do on k8s clusters [08:48:50] maybe I can ask for experiences on the kserve slack channel [13:12:12] jayme: o/ I am reading https://istio.io/latest/docs/setup/upgrade/in-place/, and afaics in our case it will be sufficient to run istioctl upgrade -f etc.. for our clusters [13:12:26] there is also the nice `istioctl x precheck` [13:15:51] yeah, that was my understanding as well (after patching yaml ofc) [13:18:21] yes yes [13:21:36] ack, last code change in https://gerrit.wikimedia.org/r/c/operations/debs/istio/+/906571 (for istio-cni basically) [13:21:45] then I can build all and test on ml-staging-codfw [13:33:44] cool, +1 :) [14:17:25] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/906590 [14:17:33] ready to test on ml-staging-codfw :) [14:17:46] the rest can probably be done next week [14:18:31] I can do only ml-staging-codfw if we think that some people could use istioctl over the weekend [14:21:34] thanks :) [14:26:00] jayme: ml-staging-codfw upgrade, the istioctl upgrade thinghy worked nicely [14:26:45] nice [14:27:01] I'll do staging-codfw as well [14:29:14] do you think that I'd need to restart the kubelets after upgrading istio-cni on those? [14:29:46] no, the cni binaries are not long running [14:30:13] ack yes [14:35:29] elukey: there is now an ingressgateway running on the control-plane. Looks like *something* changed [14:36:58] jayme: what do you mean? [14:37:07] inside istiod? [14:37:32] no, probably in the k8s manifests that get created [14:37:43] ah okok [14:37:49] the ingressgateway daemonset ignoring the master taint [14:38:03] ah the k8s control plane ok [14:38:08] now it is more clear sorry :D [14:38:16] oops, yes. Sorry [14:38:55] on ml-staging-codfw both istiod and ingress gw are deployed only on the workers [14:40:20] are you running the ingress gw as daemonset there as well? [14:40:50] yeah I use the same config.yaml [14:41:52] hmmm...no, I think you dont :-D [14:43:11] root@deploy2002:~# kubectl get daemonset -n istio-system [14:43:11] NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE [14:43:14] istio-ingressgateway 2 2 2 2 2 45d [14:43:26] why not? :) [14:43:56] because the config.yaml differs big time...at least I recall it does [14:44:29] between staging and prod? It used to be when we were upgrading to 1.23 [14:44:43] but then it got folded into a single one [14:44:46] no, between wikikube and ml I mean [14:45:25] you can check our configs, for that part they are the same [14:45:37] hmm...okay [14:45:58] I believe you for now :-p [14:45:58] you have path: spec.template.spec.tolerations that I don't though [14:46:12] this is the diff afaics [14:46:24] yeah I like when you trust me like this :D [14:48:16] afaics all metrics are showing up correctly [14:48:35] (one thing that I didn't check for the 1.15.3 upgrade and that I discovered the hard way) [14:51:42] hm...with the toleration it makes sense that it is scheduled on the control-planes as well...but it should have with 1.15.3-2 already [14:52:42] maybe it was my bad and I deployed a not-really-ready version