[03:06:36] <aiko>	 morning folks :)
[04:00:42] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: [LLM] quantization: allow loading model weights as int8/int4 with HF - https://phabricator.wikimedia.org/T377848#10389239 (10achou) **AWQ**  I finally managed to run AWQ quantized models properly! (Thanks to @MunizaA  for pointing out that we need both `use_exllama_v2=True...
[04:14:27] <aiko>	 kevinbazira: o/ I'm getting an error when importing optimum_benchmark after installing it. https://phabricator.wikimedia.org/P71638 have you encountered this before?
[05:12:56] <kevinbazira>	 aiko: o/ I have not encountered that error before.
[05:13:03] <kevinbazira>	 let me try to reproduce it ...
[05:54:02] <kevinbazira>	 Aiko, hope the following steps will help: https://phabricator.wikimedia.org/P71638#287073
[06:53:20] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: reflect input language in the response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101375 (https://phabricator.wikimedia.org/T371897)
[08:10:48] <elukey>	 hey folks!
[08:10:56] <elukey>	 ml-lab1001's /srv partition is getting full
[08:10:57] <elukey>	 elukey@ml-lab1001:/srv$ sudo du -hs *
[08:10:57] <elukey>	 233G	hf-cache
[08:10:57] <elukey>	 96G	home
[08:10:57] <elukey>	 1006M	pytorch-rocm
[08:11:07] <elukey>	 mostly it is that hf-cache dir
[08:18:52] <isaranto>	 Hello o/
[08:38:38] <isaranto>	 thanks Luca, will check the cache to see what we can clear from there
[08:44:08] <isaranto>	 aiko: great news on AWQ 🎉
[08:46:12] <isaranto>	 was that using 2 GPUs? on the memory table is "17G + 13G" referring to usage per GPU or is it something different?
[09:07:47] <aiko>	 isaranto: yep! it was using 2 GPUs (device_map="auto"). upstream seems to have fixed the issue we faced before
[09:09:02] <aiko>	 thanks Kevin! I'll give it a try
[09:13:58] <isaranto>	 aiko: could you try using 1 GPU using  so that we know what happens when using 1 (which is what we will try on Lift Wing for now)
[09:15:06] <isaranto>	 you will have to set `export CUDA_VISIBLE_DEVICES=1` there is a script on the ml-lab page https://wikitech.wikimedia.org/wiki/Machine_Learning/ML-Lab#Huggingface_Cache
[09:15:16] <isaranto>	 I'll check if it needs updating
[09:17:07] <aiko>	 ack, I'll try it later!
[09:41:23] <isaranto>	 klausman: any idea how we can overcome this https://phabricator.wikimedia.org/T377848#10387148. the pod needs access to rocminfo. I was thinking that we could attach a volume with a host path to /opt/rocm but ideally I'd like to work with upstream to allow manual configuration of rocm architecture
[09:41:38] <klausman>	 Morning!
[09:43:02] <klausman>	 Is there a reason why pytorch 2.3.0/rocm60 are used?
[09:45:51] <klausman>	 As for rocminfo, I think if we pulled in the binary and two .so's that would work, but I worry that would just result in the next step(s) needing more form /opt
[09:52:28] <isaranto>	 hey! I used pytorch 2.3.0/rocm60 because that is the image we have at the moment. I'll create a new image today in production images. 
[09:53:00] <klausman>	 I think that would solve at least the second problem
[09:53:24] <isaranto>	 for the rocminfo thing I opened an issue on GH with what I think should be a viable solution https://github.com/ROCm/bitsandbytes/issues/53 but that will take a while and I don't know how to proceed for now
[09:55:12] <klausman>	 As for mount /opt into a pod, it is something Luca and I looked at when we first wanted to address the sheer size of the rocm drivers, and if I remember correctly, there were two options: hostpTah, which is strongly warned against and local PersistentVolumes, which have the problem that k8s would think that /opt on e.g. ml-serve2009 and 2010 are two different volumes that are not
[09:55:14] <klausman>	 interchangeable
[09:55:57] <klausman>	 I think the GH issue is the right way forward, hopefully that will get a quick response
[09:56:43] <klausman>	 (I also have.... opinions about running a binary and grepping through its output to get info like that...)
[09:56:50] <elukey>	 klausman, isaranto - I don't have the full context but we shouldn't have any /opt/rocm on the k8s workers' OS nowadays
[09:57:10] <elukey>	 plus knative-serving prohibits the usage of hostPath IIRC
[09:57:40] <klausman>	 Makes sense (getting /opt rocm onto k8s workers would not be impossible, but I would ratehr avoid it)
[09:59:07] <elukey>	 the issues about having the rocm drivers on a k8s worker are multiple (os compatibility between k8s-worker / pod, fixed version shared with all pods that may need a different one, etc..)
[10:00:21] <klausman>	 Yeah, the pods-in-lockstep thing I had missed.
[10:01:13] <klausman>	 We could also look at getting rocminfo into teh image as a binary. But getting the transitive closure of all .so's might be a pain
[10:06:13] <elukey>	 yes definitely, from https://packages.debian.org/bookworm/rocminfo it brings in a lof of things (libhsa-runtime is big)
[10:07:15] <elukey>	 but Ilias' idea about the env var is good, upstream shouldn't really oppose to it, maybe they haven't just tested it in an image like pytorch (where we have only the libs)
[10:22:03] <klausman>	 If the AMD rocm packages shipped .a files, we could make a statically linked rocminfo, but alas, it is not so
[11:04:17] * klausman lunch
[11:04:56] <klausman>	 isaranto: btw, do you think it would if we offered a patch for the get-arch-from-env bug? I'd be willing to make one.
[11:05:44] <isaranto>	 btw I found the reference in the original bnb repo , for some reason search on GH wouldnt bring it up https://github.com/ROCm/bitsandbytes/blob/4aad810bc1d93c38a5316ec54c822cd12b1f1cd2/bitsandbytes/cuda_specs.py#L54
[11:05:47] <klausman>	 would help*
[11:06:00] <isaranto>	 I think it makes sense, I was thinking to do it, but feel free to
[11:07:00] <isaranto>	 for now I was going to try another hacky thing Muniza suggested: to just create a file named rocminfo that returns the string we want and add that to path :)
[11:15:22] <aiko>	 isaranto: o/ I added the results using 1 GPU: https://phabricator.wikimedia.org/T377848#10389237
[11:18:31] <isaranto>	 aiko thanks! so when u use 1 GPU it is faster for the quantized but not for the vanilla model
[11:25:40] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: llm: add rocminfo executable for gfx90a [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101491 (https://phabricator.wikimedia.org/T377848)
[11:26:53] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: llm: add dummy rocminfo executable for gfx90a [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101491 (https://phabricator.wikimedia.org/T377848)
[11:27:43] <isaranto>	 klausman: lemme know what you think --^
[11:27:50] <isaranto>	 TODO: remove this :)
[11:38:00] <wikibugs>	 (03CR) 10Klausman: [C:03+1] llm: add dummy rocminfo executable for gfx90a [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101491 (https://phabricator.wikimedia.org/T377848) (owner: 10Ilias Sarantopoulos)
[11:38:37] <klausman>	 The other option (slightly more permanent...) would be to have a script that just echoes whatever env var we want to use
[11:38:46] <klausman>	 But I am fine with hardcoding it for now
[11:41:00] <isaranto>	 ack, thanks
[11:47:08] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: add dummy rocminfo executable for gfx90a [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101491 (https://phabricator.wikimedia.org/T377848) (owner: 10Ilias Sarantopoulos)
[11:49:57] <wikibugs>	 (03Merged) 10jenkins-bot: llm: add dummy rocminfo executable for gfx90a [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101491 (https://phabricator.wikimedia.org/T377848) (owner: 10Ilias Sarantopoulos)
[11:55:45] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: llm: fix hardcoded PATH variable [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101495
[11:56:09] <isaranto>	 seems like $PATH:/new/dir works in blubber. I should have tried it before I hardcoded the path
[12:20:08] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: [SPIKE] How could we add topic filtering to Recent Changes? - https://phabricator.wikimedia.org/T381569#10390073 (10Samwalton9-WMF)
[12:25:31] <isaranto>	 ml-lab is totally full now. I don't have permissions to delete things from /srv/hf-cache but I asked in slack so that we can just delete older models
[12:26:47] <isaranto>	 klausman: is there a way I can get permission for /srv/hf-cache?
[12:27:05] <kevinbazira>	 yep. the trying to start the jupyter server throws: `OSError: [Errno 28] No space left on device`
[13:10:23] <klausman>	 isaranto: sure, just a sec
[13:10:29] <isaranto>	 I created a new production image for torch 2.5.1 https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1101524
[13:13:15] <klausman>	 I freed up some space via deduping, Kevin
[13:13:22] <isaranto>	 thanks Tobias!
[13:14:06] <klausman>	 and the perms should good now as well
[13:15:46] <isaranto>	 Dankee
[13:16:16] <isaranto>	 as for the torch 2.5.1 image I need to verify that this is the one we need at the moment. I'll do that on ml-lab
[13:16:44] <isaranto>	 but we will definitely use it in any case. I will also create a patch to delete the older unused torch images
[13:17:05] <kevinbazira>	 danke Tobias! the jupyter server is able to start now.
[13:36:59] <isaranto>	 I have verified bitsandbytes with torch 2.5.1 so we are good to go with that one
[14:06:07] <wikibugs>	 (03CR) 10Klausman: [C:03+1] llm: fix hardcoded PATH variable [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101495 (owner: 10Ilias Sarantopoulos)
[14:16:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: fix hardcoded PATH variable [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101495 (owner: 10Ilias Sarantopoulos)
[14:17:31] <wikibugs>	 (03Merged) 10jenkins-bot: llm: fix hardcoded PATH variable [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101495 (owner: 10Ilias Sarantopoulos)
[14:18:21] <wikibugs>	 (03PS2) 10Kevin Bazira: article-country: reflect input language in the response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101375 (https://phabricator.wikimedia.org/T371897)
[14:18:24] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] article-country: reflect input language in the response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101375 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[14:20:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] article-country: reflect input language in the response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101375 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[14:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: article-country: reflect input language in the response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1101375 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[14:58:37] <chrisalbon>	 Good morning
[15:02:37] <aiko>	 o/
[15:23:10] <kevinbazira>	 o/ the Flash Attention accelerated `aya-expanse-8b` model runs much faster than its vanilla counterpart: https://phabricator.wikimedia.org/P71641#287101
[15:30:01] <kevinbazira>	 TIL: FlashAttention-2 only supports `fp16` and `bf16` data types:
[15:30:01] <kevinbazira>	 https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2
[15:30:35] <isaranto>	 hey!
[15:32:37] <isaranto>	 kevinbazira: nice! I also saw that it was running much faster which is great. Tomorrow we can discuss how we can create an environmnet to build and publish the wheels so that we can use this on Lift Wing 
[15:33:01] <isaranto>	 can you write the update on the task so that it doesnt get lost from the paste https://phabricator.wikimedia.org/T371344
[15:33:03] <isaranto>	 ?
[15:35:17] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10390500 (10kevinbazira) While running quantization experiments on the `aya-expanse-8b` model in T377848#10382809, the vanilla model had an inference speed of [[ https://phabri...
[15:35:45] <isaranto>	 Weebale!
[15:35:58] <kevinbazira>	 haha :D
[15:36:34] <kevinbazira>	 thanks for the pointer, Ilias!
[15:36:59] <kevinbazira>	 I'll looke into building and publishing the wheels tomorrow :)
[15:37:23] <kevinbazira>	 *look
[15:48:58] <isaranto>	 I meant for us to discuss it and figure out a proper way to do it. We want to first build the wheels on ml-lab and deploy a model on lift wing (which is now failing) and then establish a proper way to do this with CI/CD
[15:49:20] <isaranto>	 we were discussing that gitlab would probably be the place to do that but lets chat about it tomorrow
[15:58:51] <kevinbazira>	 okok
[16:51:23] <isaranto>	 klausman: if the production image is ok could you merge it and build it when you have time? I don't have +2 on prod-images
[16:51:27] <isaranto>	 referring to https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1101524
[16:59:10] <klausman>	 ack, will do after the SRE-meeting
[17:26:25] <isaranto>	 even tomorrow, thank u!
[17:54:25] <isaranto>	 going afk folks, cu tomorrow!
[18:19:56] <wikibugs>	 (03PS1) 10Nik Gkountas: collections: add recommendation to the list only if not already present [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1101569 (https://phabricator.wikimedia.org/T381777)
[18:34:16] <klausman>	 isaranto: and published!
[18:39:49] <wikibugs>	 (03CR) 10Sbisson: "How do you envision this code evolving when we implement multiple selection and we have to return recommendations that are in some collect" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100512 (https://phabricator.wikimedia.org/T381366) (owner: 10Nik Gkountas)
[20:24:18] <isaranto>	 awesome, thanks Tobias!