[07:00:10] <isaranto>	 Good morning!
[07:01:58] <wikibugs>	 (03CR) 10Kevin Bazira: major: modernize the codebase, keep only translation recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh)
[07:05:45] <wikibugs>	 06Machine-Learning-Team: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#10001384 (10kevinbazira) 05Open→03Resolved p:05Triage→03High
[07:07:56] <isaranto>	 kevinbazira: o/ good morning
[07:08:29] <kevinbazira>	 isaranto: o/ morning morning
[07:09:06] <isaranto>	 let's resolve the CrashLoopBackOff task after it is deployed in prod. on the LLM stuff do you want to explore building and running image on the ml-testing machine?
[07:09:25] <isaranto>	 if you want to read or explore other stuff first feel free
[07:10:16] <isaranto>	 I was thinking that it is important we all do some reading and exploration separately so that we can bring in ideas. We just need to be in sync of what we're doing
[07:12:28] <kevinbazira>	 yes, using ml-testing I am going to first run the HF server the way it is documented in the KServe docs then I'll compare with how we are running it in LW.
[07:14:18] <isaranto>	 cool, thank you!I'm not aware of the amount of resources ml-testing has so curious to see what kind of model size we can test there
[07:15:30] <kevinbazira>	 yeah. they are not as much as we have on staging but I'll use the small LLMs first and share updates ...
[07:24:12] <wikibugs>	 06Machine-Learning-Team: Gemma2 in staging: HIP out of memory - https://phabricator.wikimedia.org/T370615 (10achou) 03NEW
[07:26:12] <aiko>	 ---^ we could discuss this later in the standup
[07:26:22] <aiko>	 good morning! o/
[07:28:09] <isaranto>	 o/ aiko: nice finding! I was seeing some timeouts (requests were just hanging) but I didn't see this error in the logs. yes lets discuss later how to tackle this
[07:35:54] <aiko>	 o/ ml-testing? which machine is that?
[07:35:59] <aiko>	 I'm also going to do some reading and exploration before the standup :)
[07:43:21] <isaranto>	 it is the revamped ml-sandbox!
[07:43:46] <isaranto>	 `ml-testing.machine-learning.eqiad1.wikimedia.cloud`
[08:05:21] <aiko>	 ohhhhh cool!
[08:21:16] <klausman>	 We had to retire the old Buster VM, so I created a new Bookworm one (the homedirs remained intact).
[09:24:21] <kevinbazira>	 on ml-testing: I've run the HuggingFace server based on KServe docs for the bert model. had to free up about 6GB of space to install the necessary requirements and run the model-server.
[09:40:51] <klausman>	 isaranto: I'll bounce the gemma2 pod in staging, see what happens.
[09:45:19] <isaranto>	 klausman: go ahead! 
[09:46:14] <isaranto>	 I'll be watching for its resources, so that we can organize the work around lower memory usage
[09:47:27] <isaranto>	 after discussing with Aiko, I'm going to build a variant of our image that works on apple silicon so that we can work locally 
[09:47:34] <klausman>	 ack.
[09:58:54] <klausman>	 So I see about a req/s on that service
[09:59:03] <klausman>	 from stat1010
[10:00:16] <isaranto>	 hmm I see the requests as well
[10:00:59] <klausman>	 I think it's something Diego is running
[10:01:04] <isaranto>	 I'm going to ping diego to check
[10:01:55] <isaranto>	 If we want to do work without these requests we can change the name of the service and/or model. For now I think it is ok, we can just monitor it
[10:02:24] <klausman>	 The only process on stat1010 that talks to inference-staging.svc.codfw.wmnet:30443 is owned by user dsaez :)
[10:02:31] <isaranto>	 one thing that came to mind is that we can add a flag in the deployment so that we log the payload
[10:02:38] <isaranto>	 ack!
[10:03:46] <isaranto>	 for now also logging the prompt size of the request will be a good way to understand latency vs request size
[10:04:24] <klausman>	 we should limit that logging though, otherwise someone will send us a 10GiB prompt and eat all diskspace :)
[10:05:07] <isaranto>	 true! 
[10:05:45] <klausman>	 as fro the model working in principle: I just ran a quer with the prompt "Wikipedia is" and alimit of 50 tokens, and it completed in ~10s
[10:07:10] <klausman>	 You can now also see the GPU usage in the Grafana DB
[10:07:30] <klausman>	 and the power consumption O.O
[10:08:31] <isaranto>	 niiiice, thanks Tobias!
[10:09:17] <klausman>	 going for a quick lunch bbiab
[10:35:24] * isaranto afk lunch!
[10:46:00] <aiko>	 Is it possible to increase the disk space for docker images on ml-testing? I got "no space left on device" when pulling the hf image from wmf registry
[10:46:11] <aiko>	  I already used docker system prune to clean up unused stuff, but seems still not enough space
[10:46:28] <klausman>	 is this on /srv?
[10:47:36] <aiko>	 mm I'm not sure
[10:49:27] <klausman>	 I'll add another 80G
[10:49:32] <klausman>	 er 20G, total of 80G
[10:49:47] <klausman>	 But I have to reboot the VM for that
[10:50:44] <klausman>	 kevinbazira: aiko: are you ok with me rebooting ml-testing?
[10:50:54] <aiko>	 we originally have 60G?
[10:51:43] <klausman>	 yes
[10:51:50] <klausman>	 20 for / and 60 for /srv
[10:52:13] <aiko>	 looking at ~$ sudo docker system df
[10:52:13] <aiko>	 TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
[10:52:13] <aiko>	 Images          10        2         2.268GB   1.575GB (69%)
[10:52:13] <aiko>	 Containers      2         2         21.31GB   0B (0%)
[10:52:13] <aiko>	 Local Volumes   15        2         16.32GB   16.32GB (100%)
[10:52:13] <aiko>	 Build Cache     25        0         0B        0B 
[10:52:20] <aiko>	 weird the images don't occupy a lot of space..
[10:52:36] <klausman>	 ah, let me check a setting that may be wrong
[10:53:18] <aiko>	 ack
[10:57:20] * aiko a quick lunch!
[11:01:37] <klausman>	 aiko: when you're back: what image were you pulling?
[11:04:41] <klausman>	 I'll presume  docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-huggingface:stable
[11:08:53] <klausman>	 One of the problems is that the HF image has one layer (2418fb018330) that is really small in compressed rom 2.3G or so), but it unpacks to a lot more
[11:09:40] <klausman>	 on disk, it's 15.42GB after all is unpacked
[11:10:25] <klausman>	 I deleted the old minikube layers and got us 16G back, but I think the add'l 20G I added VM-side would be useful. That will require a reboot.
[11:14:59] <aiko>	 klausman: yes that's the one I was pulling
[11:15:22] <klausman>	 it's pulled now :)
[11:16:42] <klausman>	 elukey: your homedir on the former-sandbox-now-testing VM has a 1/2GiB minikube subdir. Can I delete that in the quest for more disk space?
[11:16:46] <aiko>	 nice thank you! :)
[11:17:26] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+1] Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)
[11:17:38] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+1] "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)
[11:18:05] <elukey>	 sure
[11:18:10] <klausman>	 ty!
[11:23:17] <wikibugs>	 (03PS4) 10Nik Gkountas: Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)
[11:24:57] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+1] "Re-applying +1, after lint fix" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh)
[12:04:56] <isaranto>	 the gemma2 instance fell into the same state again, pod seems running fine, but requests get a timeout. It threw the same error about hip approx 2h ago, but since then there is not other reference in the logs
[12:06:32] <isaranto>	 https://phabricator.wikimedia.org/P66882
[12:07:37] <isaranto>	 before I make a PR to add logging I'm checking whether there is such an option in kserve/huggingfaceserver
[12:13:17] <isaranto>	 we can see temperature and power consumption going up while it is being used https://grafana.wikimedia.org/goto/ma26jouSg?orgId=1
[12:14:47] <isaranto>	 but I think it is clear that vram utilization hits 100%.
[12:15:30] <klausman>	 could be a memory leak on the GPU side. We fly pretty close to the maxmem of the GPU even without queries running.
[12:19:57] <isaranto>	 I plan to deploy the 9b model and we can observe what happens with GPU usage. wdyt?
[12:22:16] <klausman>	 sgtm
[12:23:14] <klausman>	 In theory, it should use 1/3 the Vram
[12:24:28] <isaranto>	 ok, I'm on it!
[12:26:08] <klausman>	 are ypu replacing the 27B one or running it in parallel?
[12:27:29] <isaranto>	 I was planning to just to replace the current one by editing the isvc
[12:27:41] <klausman>	 yeah, sg
[12:27:48] <klausman>	 did Deigo respond, btw?
[12:27:52] <isaranto>	 currently I'm downloading the model and uploading it to swift
[12:28:15] <isaranto>	 I didn't reach out yet! I figured it is ok to have requests coming in for the moment
[12:28:44] <klausman>	 ack
[12:28:53] <klausman>	 I think they may have stopped at some point
[12:29:01] <isaranto>	 but I will let him know that we're switching the model
[13:16:36] <isaranto>	 still uploading! will let you know once it is deployed
[13:18:55] <klausman>	 sure, no worries
[13:20:30] <isaranto>	 I was expecting to be able to see the inf service latencies for gemma2 in this dashboard https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services, however it doesn't appear in the model_name dropdown list. Any ideas?
[13:20:59] <isaranto>	 I see the other models  (e.g. the bert model is there). I am referring to codfw-ml-staging -> experimental ns
[13:25:31] <wikibugs>	 06Machine-Learning-Team: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#10002240 (10isarantopoulos) 05Resolved→03Open Let's keep this Open until we deploy the new version with the fix to production
[13:30:19] <klausman>	 lemme have a look
[13:34:07] <isaranto>	 the 9b version is deployed. This uses 17.5GB VRAM. The  deployment attributes (hostname etc) are still the ones of 27b as I have edited it on the fly
[13:34:14] <klausman>	 ack
[13:34:43] <klausman>	 as for the dashboard, I am not sure why it isn't there. from Thanos, it seems that we don't have a request_preprocess_seconds_bucket metric for it
[13:35:26] <isaranto>	 klausman: don't mean to interrupt your work if you're doing sth else. I just noticed it. We can add it in our to-do list for the week
[13:35:34] <isaranto>	 thanks for looking into it though!
[13:35:58] <klausman>	 ah, well, does the gemma model even have a preprocess stage?
[13:36:34] <klausman>	 If I curl the metrics endpoint, I get basically no usage metrics
[13:40:17] <isaranto>	 I'm not sure about that. I'll have to look into the kserve code. The bert model which is deployed in the same way does show up, but it is using a diferent part of the codebase
[13:40:51] <klausman>	 I'll do some digging
[13:42:50] <isaranto>	 well iiuc this happens cause we are using the completions endpoint instead of the predict one
[13:42:59] <klausman>	 isaranto: it looks like for gemma2, the exported metric is revision_request_latencies_bucket 
[13:43:43] <klausman>	 The technical problem is that so far, we could enumerate all possible targets to display in the Grafana dashboard by wuerying request_preprocess_seconds_bucket and extracting the list form the values that "component" has
[13:47:28] <klausman>	 I am not sure what the easiest approach here is. We may need a separate dashboard for HF services
[13:49:34] <isaranto>	 by skimming through the code it seems that it is totally different. Let's add it to the backlog of tasks, but it would be better to invest some time in this after we are sure that we're going to proceed with the service as it is
[13:49:44] <isaranto>	 just to avoid redoing the dashboard if we change things
[13:51:11] <klausman>	 Aye.
[13:51:41] <klausman>	 making a separate (from Kserve) dashboard is probably a good idea anyway, since the numbers will not really be comparable between the two.
[13:54:35] <klausman>	 Making a separate dashboard shouldn't be too much work (he said...)
[13:54:59] <isaranto>	 ack!
[14:14:31] <klausman>	 I have to quickly run an errand, bbiab
[14:31:00] <aiko>	 I was downloading the llama3-8b model to ml-testing but there seems no enough space. Only 4.5G left for /srv and the model files need ~15G
[14:31:40] <isaranto>	 ouch!
[14:34:21] <kevinbazira>	 aiko: o/ yep, I ran into a similar issue. ended up using phi 1.5 as it is way smaller
[14:36:42] <klausman>	 aiko, I can add ~20G if I can reboot the machine, but I don't want to break anything you or Ilias and Kevin are doing
[14:38:01] <isaranto>	 I'm not using ml-testing nor plan to use it todya
[14:38:04] <isaranto>	 *today
[14:38:07] <klausman>	 ack.
[14:39:09] <aiko>	 np for me!
[14:39:27] <klausman>	 kevinbazira: any objects to rebooting ml-testing?
[14:40:09] <kevinbazira>	 klausman: o/ no objections
[14:40:22] <klausman>	 alright, will proceed in a moment
[14:43:35] <klausman>	 and done
[14:43:39] <klausman>	 24G free now
[14:44:22] <klausman>	 and now 28 after changing the reserved block count to 0 (which it should have been in the first place)
[14:44:39] <aiko>	 niceeee o/
[14:44:41] <klausman>	 I can get another 20G or so once we delete the sandbox
[14:44:50] <klausman>	 i.e. the old VM
[14:46:56] <chrisalbon>	 hey all!
[14:47:18] <isaranto>	 o/ Chris!
[14:47:59] <chrisalbon>	 I am dying to know how the sprint is going
[14:50:44] <klausman>	 Well, I just conjured 24GB of disk space out of thin air to enable more LLM testing on ml-testing :)
[14:51:02] <klausman>	 Ilias and I also poked and prodded gemma2 a bit, that's still ongoing
[14:56:42] <isaranto>	 we're all getting acquainted with the current status of things and prioritize next steps: developer experience(ml-testing/sandbox  and apple silicon builds), memory issues we are already experiencing, grafana GPU dashboards etc
[14:57:07] <isaranto>	 we are using the team meeting doc for the moment to sync during the standup and we'll create tasks from there
[15:08:15] <isaranto>	 I'm rechecking the torch version numbers in the PR I opened to upstream https://github.com/kserve/kserve/pull/3783
[15:10:47] <isaranto>	 I don't think we need to pin the torch version to 2.3.0. iirc I did this before we fixed the python site-packages as I thought that this was the reason
[15:32:19] <wikibugs>	 06Machine-Learning-Team: Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656 (10kevinbazira) 03NEW
[15:39:59] <wikibugs>	 06Machine-Learning-Team: Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10003077 (10kevinbazira) I managed to host the [[ https://huggingface.co/google-bert/bert-base-uncased | bert-base-uncased ]] model. Had to free up about 6GB of space to install the necessary requirements and run...
[15:42:42] <wikibugs>	 06Machine-Learning-Team: Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10003120 (10kevinbazira) I tried running the [[ https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 | Mistral-Nemo-Instruct ]] model, and it maxed out the 30GB RAM on ml-testing. Decided to focus on LLMs t...
[15:42:48] <aiko>	 https://phabricator.wikimedia.org/P66887
[15:43:22] <aiko>	 isaranto: o/ have you encountered this error before?  ----^
[15:45:37] <wikibugs>	 06Machine-Learning-Team: Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10003153 (10kevinbazira) Finally hosted Microsoft's [[ https://huggingface.co/microsoft/phi-1_5 | phi 1.5 ]] model and used it for a text generation task with the completions endpoint: `lang=bash, name=Terminal 1:...
[15:45:49] <isaranto>	 nope! 
[15:47:36] <aiko>	 okk
[15:50:50] <isaranto>	 aiko: I was thinking that we should change the entrypoint so that it can accept cmd arguments as we did in https://phabricator.wikimedia.org/T370408#9997946
[15:52:41] <isaranto>	 that way we could append other cmd args to docker run without rebuilding the image or manually overriding the entire entrypoint. For example we could add `--dtype bfloat16` or change the inference engine as we do in deployment charts
[15:52:43] <klausman>	 isaranto: https://grafana.wikimedia.org/goto/6TsJdAXIg?orgId=1 First draft :)
[15:53:19] <aiko>	 yep that sounds good 
[15:53:30] <isaranto>	 like this https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml#L49
[15:53:50] <isaranto>	 actually then we could even change the deployment charts to accept args instead of command
[15:56:41] <isaranto>	 that looks nice Tobias, thanks! from a first look values seem a bit off but I can revisit once we do some testing to cross check response latency  vs what the chart shows
[15:58:43] <aiko>	 wait so right now we didn't use the entrypoint.sh when starting the model server?  https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/huggingface_modelserver/entrypoint.sh 
[15:59:14] <aiko>	 is "command" in deployment charts like new entrypoint for the container?
[16:01:53] <isaranto>	 yes it overrides the entrypoint
[16:03:35] <aiko>	 I see
[16:05:05] <isaranto>	 I'll open a patch to accept arguments in the same way that we did for the other model servers it seems better so that we avoid having discrepancies between local deployments and k8s
[16:05:22] <isaranto>	 and then we can just specify additional args we need in dep-charts
[16:16:36] <wikibugs>	 06Machine-Learning-Team: [LLM] Allow additional cmd arguments in hf image - https://phabricator.wikimedia.org/T370670 (10isarantopoulos) 03NEW
[16:16:45] <wikibugs>	 06Machine-Learning-Team: [LLM] Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10003421 (10isarantopoulos)
[16:16:53] <wikibugs>	 06Machine-Learning-Team: [LLM] Gemma2 in staging: HIP out of memory - https://phabricator.wikimedia.org/T370615#10003423 (10isarantopoulos)
[16:17:04] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: [LLM] Use huggingface text generation interface (TGI) on huggingface image. - https://phabricator.wikimedia.org/T370271#10003424 (10isarantopoulos)
[16:17:13] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: [LLM] Use vllm for ROCm in huggingface image   - https://phabricator.wikimedia.org/T370149#10003426 (10isarantopoulos)
[16:18:52] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: huggingface: accept cmd args in docker entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055972 (https://phabricator.wikimedia.org/T370670)
[16:19:31] <isaranto>	 the above is WIP as I'm testing it now
[16:30:27] <aiko>	 yayy running the llama3 model server with docker on ml-testing works! 
[16:31:13] <isaranto>	 nice work every1!
[16:33:08] <aiko>	 it took 4 min to get a response XD 
[16:33:17] <isaranto>	 :D
[16:33:22] <klausman>	 and the VM is probably incandescent
[16:33:44] <isaranto>	 well that would be ok if you're not in a hurry :P
[16:34:56] <isaranto>	 btw llama 3.1 is coming out https://www.reddit.com/r/LocalLLaMA/comments/1e9hg7g/comment/leedpl3/
[16:36:29] <isaranto>	 and with that I thought that we need to update our readme: what to do when a new model comes out and we want to deploy it
[16:40:48] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: huggingface: accept cmd args in docker entrypoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055972 (https://phabricator.wikimedia.org/T370670)
[16:42:52] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Tested by appending dtype arg to the docker run command with gemma-9b-it:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055972 (https://phabricator.wikimedia.org/T370670) (owner: 10Ilias Sarantopoulos)
[16:50:08] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: huggingface: add blubber image for cpu/apple silicon [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055977
[16:58:16] <aiko>	 ---^ I'll review them tomorrow the first thing!
[16:59:42] <aiko>	 next step I'm going to try building the image on ml-testing
[17:00:02] <isaranto>	 sure! just try to build it. It is WIP as I'll try to use the standard torch release instead of the cpu one so that we can utilize mps (the integrated gpu)
[17:00:04] <isaranto>	 ack
[17:00:33] <isaranto>	 I'll do that in the morning 
[17:00:44] <isaranto>	 I'm logging off for the day, talk to you tomorrow folks o/
[17:01:49] <aiko>	 see u Ilias! have a nice evening :)
[17:02:01] <isaranto>	 aiko: the cpu/apple silicon patch is the one that is WIP but it works. the other one is ready to review
[17:02:09] <isaranto>	 u2, have a nice evening!
[17:02:18] <aiko>	 ack!
[19:09:06] <wikibugs>	 06Machine-Learning-Team: [LLM] Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10004394 (10achou) Today I built and ran the llama3-8B-instruct model server on ml-testing. (Thank Tobias for conjuring 24GB of disk space!)   ` $ docker run --rm -v /srv/models/Meta-Llama-3-8B-Instruct:/mnt...