[06:30:54] <wikibugs>	 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10elukey) This is what I have used to test the GPU:  ` apiVersion: v1 kind: Pod metadata:   name: alexnet-tf-gpu-pod   labels:     purpose: demo-tf-amdgpu spec:...
[07:05:00] <elukey>	 Hi folks! I have some errands to do this morning, I'll be spotty-connected, if you need me ping and I'll answer asap :)
[07:19:50] <isaranto>	 Good morning!
[08:20:55] <klausman>	 \o
[08:21:27] <klausman>	 elukey: btw, after the GPU swap, ml-serve1001 booted up with ferm not coming up correctly, so I (re)started it and the alerts cleared
[08:40:33] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: feat: reduce llm memory footprint (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:45:11] <elukey>	 klausman: ack nice!
[08:45:21] <elukey>	 sometimes it happens due to failures in dns resolution IIRC
[08:45:58] <klausman>	 yeah, if ferm can't resolve the names in the rules file it just dies. A bug, IMO, maybe I'll look into sending a patch upstream.
[08:46:18] <klausman>	 (Though I am much more of an nftables guy these days)
[08:48:35] <elukey>	 klausman: there is the chance that the bug is already fixed in bookworm or newer versions of ferm
[08:49:37] <klausman>	 Also possible. Though I suspect it's not specifically a bug in ferm itself. iptables can be given DNS names, and will resolve them itself at rule-creation time. I suspect that fails, and Ferm just treats that failure in a generic way. I dunno if it even can be distinguished from other iptables failures.
[08:50:32] <klausman>	 The right thing™ might be to change the startup requirements for ferm. There is a net.online-like target in systemd that only completes after systemd is of the opinion that the machine is online properly (e.g. DHCP has completed).
[08:50:46] <klausman>	 But that's  something to research for future-me :)
[08:52:38] <elukey>	 you can also ask to Moritz if the bug is already raised, check debian bugs etc..
[08:52:45] <elukey>	 anyway, back to errand mode, bbl!
[08:52:48] <klausman>	 yeah, of course, I'd do that first
[08:52:52] <klausman>	 ttyl
[09:00:15] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] feat: reduce llm memory footprint (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[10:11:48] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: reduce llm memory footprint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[10:17:18] <wikibugs>	 (03Merged) 10jenkins-bot: feat: reduce llm memory footprint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/926507 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[10:35:27] <wikibugs>	 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) This is great! Launching the experimental namespace is probably the best/easiest thing to do. Will try to use the amd gpu asap
[11:03:08] <klausman>	 elukey: added the rate limit class addition to deployment for today (in ~2h)
[11:18:33] <klausman>	 scratch that, I am not sure where my change would go in the various deployment windows
[11:29:14] <klausman>	 scratch the scratch, it's on :)
[12:42:51] <klausman>	 elukey: 250k qph rate limit for class wme is deployed and confirmed working
[12:43:53] <klausman>	 taking a break and going for an eagle stomp, bbiab
[12:58:15] <elukey>	 klausman: ack nice!
[12:58:59] <elukey>	 klausman: can you update the slack thread with WME and the wikitech docs about the various classes?
[12:59:30] <elukey>	 what we have stated in wikitech and api.wikimedia.org is not really the truth (about 200k req/hour)
[13:00:57] <elukey>	 isaranto: shall we test the gpus with bloom-3b??
[13:01:46] <isaranto>	 elukey: yep that's what I thought - > https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/927620
[13:02:43] <isaranto>	 when I do helmfile template locally I dont see resources anywhere in the InferenceService produced..
[13:03:47] <isaranto>	 I 'm not talking only about gpu but also for cpu and memory. however the deployed ones we have work fine. Perhaps I am missing something...
[13:04:07] <elukey>	 what command do you use?
[13:04:14] <elukey>	 in the template we hacve
[13:04:15] <elukey>	       resources:
[13:04:16] <elukey>	         {{- toYaml $resources_config | nindent 8 }}
[13:04:20] <elukey>	 so it should work
[13:04:54] <isaranto>	 yeah I saw that..
[13:04:56] <isaranto>	 `helmfile -e ml-serve-codfw template`
[13:06:16] <elukey>	 and if you try with helm template -f etc... ?
[13:06:24] <elukey>	 I usually test with that one
[13:08:03] <elukey>	 anyway, I applied the change manually on the current bloom-3b isvc in ml-serve-eqiad
[13:08:09] <elukey>	 so we can see how it goes before merging changes
[13:08:20] <isaranto>	 works great :)
[13:08:30] <isaranto>	 `helm template ../../../charts/kserve-inference -f values.yaml`
[13:09:52] <isaranto>	 I thought we should add 2 deployments - one with gpu and one without so we can easily test side by side
[13:10:07] <elukey>	 definitely eys
[13:10:10] <elukey>	 *yes
[13:10:27] <isaranto>	 is the gpu toleration added in the node?
[13:10:37] <elukey>	 what do you mean?
[13:11:02] <isaranto>	 w8
[13:12:33] <isaranto>	 hm perhaps I was confused . I think I saw sth but it was for autoscaling
[13:13:53] <isaranto>	 ok found it, nothing to do with autoscaling .it was about node scheduling https://kserve.github.io/website/0.8/modelserving/nodescheduling/inferenceservicenodescheduling/#node-selector
[13:14:24] <elukey>	 we don't have labels on gpu nodes yet
[13:14:36] <isaranto>	 ack
[13:14:43] <isaranto>	 seems like the gpu pod started!
[13:15:30] <elukey>	 but it doesn't work, can't see logs etc..
[13:15:33] <elukey>	 very weird
[13:16:30] <elukey>	 mmm
[13:16:31] <elukey>	   Warning  Evicted    93s                   kubelet            The node was low on resource: ephemeral-storage. Container kserve-container was using 84Ki, which exceeds its request of 0
[13:17:12] <isaranto>	 hm
[13:17:23] <isaranto>	 lol I was gonna say it is working but I sent a request to staging
[13:25:59] <isaranto>	 I got a reqponse from the pod but then it crashed
[13:31:07] <elukey>	 it is weird since there seems to be issue with limit ranges and ephemeral storage
[13:31:33] <elukey>	 The node was low on resource: ephemeral-storage. Container istio-proxy was using 10828Ki, which exceeds its request of 0. Container queue-proxy was using 48Ki, which exceeds its request of 0. Container kserve-container was using 84Ki, which exceeds its request of 0.
[13:31:45] <elukey>	 never seen this before
[13:41:11] <elukey>	 aiko: o/ changeprop deployed in staging!
[13:46:30] <aiko>	 elukey: thanks!! :)
[13:49:05] <klausman>	 elukey: wiki and slack done re: ratelimits
[13:50:04] <elukey>	 klausman: api.wikimedia.org too?
[13:50:21] <klausman>	 hich page there?
[13:50:49] <elukey>	 https://api.wikimedia.org/wiki/API_reference/Service/Lift_Wing -> Authenticated Requests
[13:50:58] <klausman>	 Ah, no. Fixing
[13:52:44] <elukey>	 klausman: what wikitech page did you update? I checked https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Authentication but it shows the old procedure 
[13:52:52] <elukey>	 I mean without the new stuff
[13:53:51] <elukey>	 isaranto: how big is bloom-3b ?
[13:53:55] <elukey>	 in size I mean
[13:54:14] <elukey>	 because I tried to set 10G as ephemeral storage and I got OSError: [Errno 28] No space left on device
[13:54:17] <elukey>	 :D
[13:54:37] <klausman>	 https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_assign_a_client_to_rate_limit_tier
[13:54:42] <klausman>	 I'll fix the LW page too
[13:54:46] <elukey>	 super
[13:55:22] <isaranto>	 elukey: approx 12GB
[13:55:56] <klausman>	 elukey: the LW page I went with 200k instead of 250k, since our config is set to the lower value
[13:56:09] <klausman>	 "our config" = "our part of the API GW config"
[13:56:11] <elukey>	 +1
[13:57:04] <elukey>	 isaranto: I have no idea why now I have to set them and before none, I guess that there is a horror or logs or similar
[13:59:33] <isaranto>	 same here
[14:00:03] <isaranto>	 also I remembered there are 2 things I need to change in python code for gpu usage
[14:00:18] <elukey>	 ok now I see logs
[14:00:19] <isaranto>	 actually best thing is to check for gpu and if it doesnt exist go for cpu
[14:06:42] <wikibugs>	 10Machine-Learning-Team, 10API Platform: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 (10elukey) a:03klausman
[15:01:22] <elukey>	 isaranto: so bloom-560 works
[15:01:41] <isaranto>	 great!
[15:01:55] <isaranto>	 lemme check latencies
[15:02:15] <elukey>	 and I can see the GPU in the container
[15:02:21] <elukey>	 but I am not sure if it uses it by default
[15:02:34] <isaranto>	 a yes as I said above it wont use it
[15:02:41] <isaranto>	 I have to make 2 code changes first
[15:02:43] <elukey>	 ahh ok right sorry
[15:02:56] <isaranto>	 I forgot as well
[15:03:29] <elukey>	 bloom-3b for some reason, on another node, works fine
[15:03:36] <elukey>	 so adding the GPU makes things weird for it
[15:10:17] <isaranto>	 it is weird, especially since it is not being used
[15:16:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] revert-risk: handle unsupported edit types for wikidata model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/924912 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[15:23:34] <klausman>	 https://blog.koehntopp.info/2020/08/31/on-touching-candles.html Since we've been talking about SLOs recently. A blog post by a good friend of mine, about how some things are learned, and why error budgets (aka: the _other part of the 100%) are very important.
[15:34:26] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861)
[15:44:34] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861)
[15:47:40] <wikibugs>	 (03CR) 10Elukey: fix: add missing requirements for falcon-7b model and enable GPU support (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[15:49:08] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) Semantics in pytorch are a bit weird related to rocm: https://pytorch.org/docs/stable/notes/hip.html So as far as I underst...
[15:52:41] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861)
[15:54:06] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: fix: add missing requirements for falcon-7b model and enable GPU support (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[15:54:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] fix: add missing requirements for falcon-7b model and enable GPU support (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/927733 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[15:55:06] <isaranto>	 hopefully the image that will be built with the above patch will be for rocm. I am thinking of creating a single LLM image we can use for now having two variants (one cpu and one gpu)
[15:56:48] <isaranto>	 I'm logging off folks, more stuff tomorrow!
[15:57:11] <elukey>	 ack!
[15:58:46] <elukey>	 going afk as well, o/