[06:30:54] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) This is what I have used to test the GPU: ` apiVersion: v1 kind: Pod metadata: name: alexnet-tf-gpu-pod labels: purpose: demo-tf-amdgpu spec:... [07:05:00] Hi folks! I have some errands to do this morning, I'll be spotty-connected, if you need me ping and I'll answer asap :) [07:19:50] Good morning! [08:20:55] \o [08:21:27] elukey: btw, after the GPU swap, ml-serve1001 booted up with ferm not coming up correctly, so I (re)started it and the alerts cleared [08:40:33] (03CR) 10Ilias Sarantopoulos: feat: reduce llm memory footprint (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:45:11] klausman: ack nice! [08:45:21] sometimes it happens due to failures in dns resolution IIRC [08:45:58] yeah, if ferm can't resolve the names in the rules file it just dies. A bug, IMO, maybe I'll look into sending a patch upstream. [08:46:18] (Though I am much more of an nftables guy these days) [08:48:35] klausman: there is the chance that the bug is already fixed in bookworm or newer versions of ferm [08:49:37] Also possible. Though I suspect it's not specifically a bug in ferm itself. iptables can be given DNS names, and will resolve them itself at rule-creation time. I suspect that fails, and Ferm just treats that failure in a generic way. I dunno if it even can be distinguished from other iptables failures. [08:50:32] The right thing™ might be to change the startup requirements for ferm. There is a net.online-like target in systemd that only completes after systemd is of the opinion that the machine is online properly (e.g. DHCP has completed). [08:50:46] But that's something to research for future-me :) [08:52:38] you can also ask to Moritz if the bug is already raised, check debian bugs etc.. [08:52:45] anyway, back to errand mode, bbl! [08:52:48] yeah, of course, I'd do that first [08:52:52] ttyl [09:00:15] (03CR) 10Kevin Bazira: [C: 03+1] feat: reduce llm memory footprint (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:11:48] (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: reduce llm memory footprint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:17:18] (03Merged) 10jenkins-bot: feat: reduce llm memory footprint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:35:27] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) This is great! Launching the experimental namespace is probably the best/easiest thing to do. Will try to use the amd gpu asap [11:03:08] elukey: added the rate limit class addition to deployment for today (in ~2h) [11:18:33] scratch that, I am not sure where my change would go in the various deployment windows [11:29:14] scratch the scratch, it's on :) [12:42:51] elukey: 250k qph rate limit for class wme is deployed and confirmed working [12:43:53] taking a break and going for an eagle stomp, bbiab [12:58:15] klausman: ack nice! [12:58:59] klausman: can you update the slack thread with WME and the wikitech docs about the various classes? [12:59:30] what we have stated in wikitech and api.wikimedia.org is not really the truth (about 200k req/hour) [13:00:57] isaranto: shall we test the gpus with bloom-3b?? [13:01:46] elukey: yep that's what I thought - > https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/927620 [13:02:43] when I do helmfile template locally I dont see resources anywhere in the InferenceService produced.. [13:03:47] I 'm not talking only about gpu but also for cpu and memory. however the deployed ones we have work fine. Perhaps I am missing something... [13:04:07] what command do you use? [13:04:14] in the template we hacve [13:04:15] resources: [13:04:16] {{- toYaml $resources_config | nindent 8 }} [13:04:20] so it should work [13:04:54] yeah I saw that.. [13:04:56] `helmfile -e ml-serve-codfw template` [13:06:16] and if you try with helm template -f etc... ? [13:06:24] I usually test with that one [13:08:03] anyway, I applied the change manually on the current bloom-3b isvc in ml-serve-eqiad [13:08:09] so we can see how it goes before merging changes [13:08:20] works great :) [13:08:30] `helm template ../../../charts/kserve-inference -f values.yaml` [13:09:52] I thought we should add 2 deployments - one with gpu and one without so we can easily test side by side [13:10:07] definitely eys [13:10:10] *yes [13:10:27] is the gpu toleration added in the node? [13:10:37] what do you mean? [13:11:02] w8 [13:12:33] hm perhaps I was confused . I think I saw sth but it was for autoscaling [13:13:53] ok found it, nothing to do with autoscaling .it was about node scheduling https://kserve.github.io/website/0.8/modelserving/nodescheduling/inferenceservicenodescheduling/#node-selector [13:14:24] we don't have labels on gpu nodes yet [13:14:36] ack [13:14:43] seems like the gpu pod started! [13:15:30] but it doesn't work, can't see logs etc.. [13:15:33] very weird [13:16:30] mmm [13:16:31] Warning Evicted 93s kubelet The node was low on resource: ephemeral-storage. Container kserve-container was using 84Ki, which exceeds its request of 0 [13:17:12] hm [13:17:23] lol I was gonna say it is working but I sent a request to staging [13:25:59] I got a reqponse from the pod but then it crashed [13:31:07] it is weird since there seems to be issue with limit ranges and ephemeral storage [13:31:33] The node was low on resource: ephemeral-storage. Container istio-proxy was using 10828Ki, which exceeds its request of 0. Container queue-proxy was using 48Ki, which exceeds its request of 0. Container kserve-container was using 84Ki, which exceeds its request of 0. [13:31:45] never seen this before [13:41:11] aiko: o/ changeprop deployed in staging! [13:46:30] elukey: thanks!! :) [13:49:05] elukey: wiki and slack done re: ratelimits [13:50:04] klausman: api.wikimedia.org too? [13:50:21] hich page there? [13:50:49] https://api.wikimedia.org/wiki/API_reference/Service/Lift_Wing -> Authenticated Requests [13:50:58] Ah, no. Fixing [13:52:44] klausman: what wikitech page did you update? I checked https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Authentication but it shows the old procedure [13:52:52] I mean without the new stuff [13:53:51] isaranto: how big is bloom-3b ? [13:53:55] in size I mean [13:54:14] because I tried to set 10G as ephemeral storage and I got OSError: [Errno 28] No space left on device [13:54:17] :D [13:54:37] https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_assign_a_client_to_rate_limit_tier [13:54:42] I'll fix the LW page too [13:54:46] super [13:55:22] elukey: approx 12GB [13:55:56] elukey: the LW page I went with 200k instead of 250k, since our config is set to the lower value [13:56:09] "our config" = "our part of the API GW config" [13:56:11] +1 [13:57:04] isaranto: I have no idea why now I have to set them and before none, I guess that there is a horror or logs or similar [13:59:33] same here [14:00:03] also I remembered there are 2 things I need to change in python code for gpu usage [14:00:18] ok now I see logs [14:00:19] actually best thing is to check for gpu and if it doesnt exist go for cpu [14:06:42] 10Machine-Learning-Team, 10API Platform: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 (10elukey) a:03klausman [15:01:22] isaranto: so bloom-560 works [15:01:41] great! [15:01:55] lemme check latencies [15:02:15] and I can see the GPU in the container [15:02:21] but I am not sure if it uses it by default [15:02:34] a yes as I said above it wont use it [15:02:41] I have to make 2 code changes first [15:02:43] ahh ok right sorry [15:02:56] I forgot as well [15:03:29] bloom-3b for some reason, on another node, works fine [15:03:36] so adding the GPU makes things weird for it [15:10:17] it is weird, especially since it is not being used [15:16:04] (03CR) 10Elukey: [C: 03+1] revert-risk: handle unsupported edit types for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [15:23:34] https://blog.koehntopp.info/2020/08/31/on-touching-candles.html Since we've been talking about SLOs recently. A blog post by a good friend of mine, about how some things are learned, and why error budgets (aka: the _other part of the 100%) are very important. [15:34:26] (03PS1) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) [15:44:34] (03PS2) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) [15:47:40] (03CR) 10Elukey: fix: add missing requirements for falcon-7b model and enable GPU support (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:49:08] 10Machine-Learning-Team, 10Patch-For-Review, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) Semantics in pytorch are a bit weird related to rocm: https://pytorch.org/docs/stable/notes/hip.html So as far as I underst... [15:52:41] (03PS3) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) [15:54:06] (03CR) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:54:53] (03CR) 10Elukey: [C: 03+1] fix: add missing requirements for falcon-7b model and enable GPU support (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:55:06] hopefully the image that will be built with the above patch will be for rocm. I am thinking of creating a single LLM image we can use for now having two variants (one cpu and one gpu) [15:56:48] I'm logging off folks, more stuff tomorrow! [15:57:11] ack! [15:58:46] going afk as well, o/