[06:13:51] (03CR) 10Kevin Bazira: [C:03+1] docs: add info how to use a newly released hf model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056504 (owner: 10Ilias Sarantopoulos) [06:24:59] (03CR) 10Santhosh: [C:04-1] "I had to do a minor fix to get it running." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [06:46:32] aloha! [07:48:31] I instead of just writing a script to make requests to huggingface images I started adding a locust file we can use instead [07:49:36] I thought it would be easier to reuse, although the input will probably be gibberish. I'm planning to provide variant input/output sizes [07:50:56] Let me know if you have any ideas or suggestions on this [08:26:39] 10Lift-Wing, 06Machine-Learning-Team: [LLM] add locust entry for huggingfaceserver - https://phabricator.wikimedia.org/T370992 (10isarantopoulos) 03NEW [08:43:15] (03PS1) 10Ilias Sarantopoulos: (WIP) locust entry for hf [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056872 (https://phabricator.wikimedia.org/T370992) [08:45:12] (03PS2) 10Ilias Sarantopoulos: (WIP) locust entry for hf [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1056872 (https://phabricator.wikimedia.org/T370992) [09:24:53] m,orning, and sorry for missing the standup. I was in the middle of a reboot cycle and missed the calendar notification [09:26:01] morning Tobias! no worries [09:26:37] klausman: I'm looking into some VRAM usage spikes that took place while I was making requests and this could be also the cause of the outage of the 27b model [09:26:48] https://grafana.wikimedia.org/goto/gOeLKyXIR?orgId=1 [09:26:57] nothing to do just an fyi [09:27:03] ack [09:27:06] I'll probably redeploy a couple of times [09:28:00] aiko: kevinbazira: this is the updated kserve roadmap with a lot of work planned for GenAI inference https://github.com/kserve/kserve/pull/3810 [09:28:36] isaranto: o/ thanks for sharing. let me have a look [09:57:57] (03CR) 10Nik Gkountas: Add support for section translation recommendations (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [09:58:03] (03PS4) 10Nik Gkountas: Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) [10:11:36] (03CR) 10Santhosh: [C:03+2] Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [10:12:53] (03Merged) 10jenkins-bot: Add support for section translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056556 (https://phabricator.wikimedia.org/T370746) (owner: 10Nik Gkountas) [10:21:05] * isaranto lunch [10:24:29] * klausman too [12:02:59] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] add locust entry for huggingfaceserver - https://phabricator.wikimedia.org/T370992#10014330 (10isarantopoulos) I've managed to run the locust tests from stat1008 for `gemma2-9b-it` using the following process: # Checkout the isvc repo and th... [13:23:27] (03CR) 10Ilias Sarantopoulos: "Apologies for the late reploy, I haven't managed to solve that yet :(" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [13:29:07] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] fetch data from mwapi using revid instead of article title - https://phabricator.wikimedia.org/T371021 (10isarantopoulos) 03NEW [13:29:59] I created a task for the outlink model preprocessing. I requested a bit more information on the reasoning that would justify this change and the work involved and I'll update it once it is more clear [13:43:17] tried to use the memory util in huggingface spaces but can't get it to work for gemma2 https://huggingface.co/spaces/hf-accelerate/model-memory-usage [13:57:11] the guide for inference on amd gpus in hf seems that it is updated https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu#accelerated-inference-on-amd-gpus although incomplete [13:57:45] it uses optimum.onnxruntime which would be tricky to test using our current implementation [14:01:18] Good morning [14:30:36] \o morning Chris [14:43:00] so, putting the model on 2 gpus is as easy as assigning 2 gpus to the pod. I tried it and it works. However running inference isn't that simple as you can see in the related error https://phabricator.wikimedia.org/P66927 [14:43:56] by default everything (all tensors for model + input) is expected to be on the same device [14:44:25] I'll open up a task with info so that we can add resources and track this properly [14:45:32] it is probably sth we'll do 2-3 steps down the road from now but it seems that we'll need it [14:45:38] thanks kevinbazira: for bringing this up today, it totally slipped my mind! [14:55:08] Oh interesting! [14:55:30] I definitely think it will be needed [14:56:58] Perplexity runs on llama3 70b, we might get away with a smaller model but I doubt a 7b model will do everything product and research want [15:01:20] I'm confident we can manage to fit 70b, brining down inference latency will be a different beast :) [15:20:03] (03PS1) 10KartikMistry: Fix accesslog file name [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056966 [15:24:28] (03CR) 10Abijeet Patro: [C:03+2] Fix accesslog file name [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056966 (owner: 10KartikMistry) [15:25:12] (03Merged) 10jenkins-bot: Fix accesslog file name [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1056966 (owner: 10KartikMistry) [16:03:43] going afk folks o/ [16:07:55] \o [16:59:48] o/