[01:18:31] (03CR) 10Jsn.sherman: "@isarantopoulos@wikimedia.org is there any update on this?" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [07:08:16] morning! [07:29:37] morning! [08:28:47] so I tested the low_cpu_mem_usage option in ml-testing using my kserve fork https://github.com/AikoChou/kserve/commit/4c3c289e551a89e885f6bd8850070ecf3fadd1cf [08:28:59] and used docker stats to monitor the container's memory usage when loading a llama3 model [08:29:35] I didn't see any improvement in the memory usage [08:34:52] I had some issues pushing code to our wikipedia kserve fork so I didn't use that [08:35:04] got "remote: Permission to wikimedia/kserve.git denied to AikoChou" [08:35:30] hey, lemme look into that, perhaps u don't have permissions [08:35:31] do I need to do any setting? [08:35:52] okk [08:37:02] I added the whole team, can u check again? [08:37:32] alright let me check [08:38:38] (03PS12) 10Santhosh: major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) [08:39:14] yess it works now! [08:39:38] thank you :D [08:39:44] \o/ [08:40:00] (03CR) 10Santhosh: "The pageviews returned by the API is 0 - randomly?" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [08:49:07] (03CR) 10Santhosh: [C:03+2] major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [08:49:44] (03Merged) 10jenkins-bot: major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [08:56:31] I wanted to change the liftwing branch (switch the supported vllm version) but I have the PR open for upstream and it GH ui doesn't allow you to switch branch, so I'll either have to create a new branch or change the branch we use in inf-services [08:56:39] unless the PR gets merged soon [09:28:28] I think it is ok, we can change the branch we use in int-services [09:28:52] or create a new branch if you prefer [09:30:51] ok! lets switch the branch when we need it, I'm not doing any changes for now so it is ok from my side [09:31:09] I noticed that the postmerge pipeline failed on the patch that adds the cmd args to the image https://integration.wikimedia.org/ci/job/inference-services-pipeline-huggingface-publish/10/console [09:31:41] and it failed when it tries to delete the image [09:31:41] ``` [09:31:41] + docker rmi --force 330ad88509d9d3b3ec34c3f5bb01faeb980a907ad1c354a3513a30989dd9ce6e [09:31:41] Error: No such image: 330ad88509d9d3b3ec34c3f5bb01faeb980a907ad1c354a3513a30989dd9ce6e [09:31:41] ``` [09:32:18] but there is a new tag in the docker registry, I'm going to check if this is the image coming from this patch https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-huggingface/tags/ [09:35:10] dunno if there is any other way to check this without downloading and checking either the sha256 or inspecting the code in the image [09:35:19] oh the new tag is coming from another patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1055977 [09:36:44] thanks! then the patch I want is already included in that one (it was merged a couple of hours earlier) [09:37:07] so I'm testing it now on bert/experimental and will open a patch after I'm sure it works [09:37:34] ack o/ [10:00:12] 06Machine-Learning-Team: [LLM] Run LLMs locally in ml-testing - https://phabricator.wikimedia.org/T370656#10009637 (10kevinbazira) I have been containerizing and testing the KServe HuggingFace model-server with [[ https://doc.wikimedia.org/releng/blubber/ | blubber ]], similar to how we deploy it on LiftWing. I... [10:23:12] (03PS1) 10Kevin Bazira: huggingface: remove unused vllm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056461 (https://phabricator.wikimedia.org/T370656) [10:25:07] * aiko lunch! [10:25:57] (03CR) 10Ilias Sarantopoulos: [C:03+1] huggingface: remove unused vllm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056461 (https://phabricator.wikimedia.org/T370656) (owner: 10Kevin Bazira) [10:27:15] (03CR) 10Kevin Bazira: [C:03+2] huggingface: remove unused vllm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056461 (https://phabricator.wikimedia.org/T370656) (owner: 10Kevin Bazira) [10:27:58] (03Merged) 10jenkins-bot: huggingface: remove unused vllm package [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056461 (https://phabricator.wikimedia.org/T370656) (owner: 10Kevin Bazira) [10:29:45] (03PS6) 10Nik Gkountas: Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [10:30:03] * klausman lunch [10:42:26] (03PS7) 10Santhosh: Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) [10:46:20] (03CR) 10Santhosh: "I changed the AND operator from `&` to `+`. Anyway it require URL encoding." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [10:59:23] * isaranto lunch [12:22:04] 06Machine-Learning-Team: [LLM] Allow additional cmd arguments in hf image - https://phabricator.wikimedia.org/T370670#10010193 (10isarantopoulos) With the merged patch we would have the below change in deployment charts: ` command: [ "./entrypoint.sh"] args: ["--dtype", "float32"] ` Tested the merged patch by... [12:43:36] I'm testing the above change in the gemma deployment [12:59:20] ok, this one works great. I tested the image with the latest change kevin pushed that removed vllm image [12:59:26] *package not image [13:11:42] This is still 9B, right? [13:19:46] yes [13:19:56] Good morning all [13:20:59] \o morning, Chris [13:26:25] \o morning! [13:34:44] (03PS1) 10Ilias Sarantopoulos: docs: add info how to use a newly released hf model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056504 [13:43:12] isaranto: SO I have looked at the metrics side some more, and there is an oddity with the HF server [13:43:41] As I mentioned, the istio metrics are probably not suitabel, since they clearly measure something that isn't end-to-end latency [13:44:33] But if I scrape the HF server endpoint, I get https://phabricator.wikimedia.org/paste/edit/form/26/ [13:45:06] Note how at the bottom there are metric descriptors that might be useful, but they have no actual exemplars [13:46:23] I guess you mean this https://phabricator.wikimedia.org/P66917 (you pasted the link to the paste form :) ) [13:46:28] oops [13:46:36] https://phabricator.wikimedia.org/P66917 is indeed correct [13:47:54] I dunno if the HF server bianry/module needs to be told to export metrics. It would also be nice to be able to add a prefix to the metrics (like hf_server or sth) [13:50:39] thanks! we'll have to look into that, I haven't checked what is the status of exported metrics for the hf server. The difference with other model servers that we have is that we're using a different api [13:51:42] https://github.com/kserve/kserve/tree/master/python/kserve this seems to indicate that it hould just work, but I haven't dived into the code [13:51:52] all the descriptors we can see in the bottom of the above paste refer to the predict/explain/preprocess api as they are defined in the standard kserve model servers. Although these exist in the hf server as well we are using the api/completions endpoint instead [15:06:50] so the test results for the low_cpu_mem_usage option show it does reduce memory usage when loading a model (3.4G vs 1.6G) [15:07:03] however the improvement is not as significant as just setting the torch dtype to bfloat16 (3.4G vs 0.4G) [15:07:38] I'm going to create a phab task and add results detail there [15:10:01] nice! [15:10:29] which model did u use? is the 0.4G usage with the low_cpu_mem_usage +bfloat16? [15:12:04] gemma2-9b [15:15:19] with or w/o the low_cpu_mem_usage were both ~0.4G [15:19:02] the starting 3.4G is the thing that puzzles me. the model is approx 18GB in size. So the GPU VRAM that is used in the 9b model case is close to the 18GB mark https://grafana.wikimedia.org/goto/4fNttLuSR?orgId=1 [15:19:53] now that I look into the cpu memory of the kserve container it seems that it is also the case for the cpu memory https://grafana.wikimedia.org/goto/LIbhpLXSg?orgId=1 [15:22:36] I'm starting to wonder if the x2 memory usage is not the case anymore for transformers. I either missed this or indeed it could use half the memory [15:47:51] yeah not sure why it took less than 18G on my laptop [15:49:30] we'll figure it out! [15:50:09] I opened a patch to update the gemma2 model image and also change the way we pass cmd args https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1056538 [15:50:52] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#10011189 (10KStoller-WMF) [15:51:11] nice! looking [15:52:29] I'll investigate the bert model failure a bit more, but maybe sth is off, since it worked when we last tried it locally https://phabricator.wikimedia.org/T370656#10003077 [16:00:06] what if you change dtype to float16? the above used float16 and you used float32 [16:00:14] don't know if it affects [16:02:06] I just tested it with float16 and got the same error :( [16:32:06] one thing I need to do this week that I still owe to folks is this https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/1035044 [16:32:38] figure out how to make the same requests in otulink-topic model using the revid instead of the article title [16:32:54] so taking a little detour from LLMs [16:40:01] (03CR) 10Nik Gkountas: Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [16:40:05] (03CR) 10Nik Gkountas: [C:03+2] Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [16:40:43] (03Merged) 10jenkins-bot: Recommend articles to translate based on topic [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [16:54:16] (03PS1) 10Nik Gkountas: Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) [16:54:51] (03CR) 10CI reject: [V:04-1] Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [16:58:46] (03PS2) 10Nik Gkountas: Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) [16:59:22] (03CR) 10CI reject: [V:04-1] Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [16:59:38] https://simonwillison.net/2024/Jul/24/mistral-large-2/ Oh look :) [17:00:00] 123B params, ouch [17:00:08] ouch! [17:00:18] from the link: Mistral Large 2 is 123 billion parameters, "designed for single-node inference" (on a very expensive single-node!) and has a 128,000 token context window, the same size as Llama 3.1. [17:00:22] nice though! [17:00:40] (03PS3) 10Nik Gkountas: Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) [17:01:33] I would want `mistral_large_in_a_small_size` :P [17:01:35] logging off for the day folks have a nice evening/rest of day! [17:01:45] \o [17:11:11] bye Ilias o/ [17:37:56] 06Machine-Learning-Team: [LLM] Explore low_cpu_mem_usage option when loading model in transformers - https://phabricator.wikimedia.org/T370935 (10achou) 03NEW [17:51:56] 06Machine-Learning-Team: [LLM] Explore low_cpu_mem_usage option when loading model in transformers - https://phabricator.wikimedia.org/T370935#10012109 (10achou) First I tested this option enabled in the kserve huggingface server (using this [[ https://github.com/wikimedia/kserve/commit/6754b18b40351bc35152f0a33... [18:10:48] 06Machine-Learning-Team: [LLM] Explore low_cpu_mem_usage option when loading model in transformers - https://phabricator.wikimedia.org/T370935#10012177 (10achou) Next I tested this option locally when loading the `gemma-2-9b-it` model in transformers using this simple [[ https://huggingface.co/google/gemma-2-9b-...