[05:21:12] Good morning! [06:48:47] 06Machine-Learning-Team: 14Add pyopencl requirements to images that use resource_utils - 14https://phabricator.wikimedia.org/T360212#9678582 (10isarantopoulos) 05Open→03Resolved [06:49:38] 06Machine-Learning-Team, 06Wikipedia-Android-App-Backlog: 14Investigate increased preprocessing latencies on LW of article-descriptions model - 14https://phabricator.wikimedia.org/T358195#9678584 (10isarantopoulos) 05Open→03Resolved [08:13:20] morning folks :) [08:26:19] hey aiko o/ [08:30:24] I think I have never built so many docker images in my life before :D [09:30:40] 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9679204 (10isarantopoulos) The above "issue" with numpy seems that it is not an issue after all. Numpy was removed as a requirement after torch 1.9 but they do maintain an aggressive warning as... [11:26:46] I've made a "breakthrough" on the hf image size and the docker layers -> https://phabricator.wikimedia.org/T357986#9679664 [11:26:51] let me know what you think [11:30:58] o/ [11:49:24] Hey Luca! [11:50:18] I'm going afk folks, will be back to check later [11:54:33] ttl! [12:49:28] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9679664 (10isarantopoulos) I've built an image by explicitly defining all the requirements (instead of pip resolving the dependencies by its own). This resulted in a reduced i... [12:51:08] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9679703 (10elukey) Hello! There `ores_cache` job should be defined but disabled in the running config, we don't use it a... [13:01:09] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9679991 (10elukey) I have seen the same behavior, namely pip trying to download the torch's cpu version and ending up only installing nvidia-related packages. I like the expli... [13:10:03] Morning all [13:11:45] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9680024 (10akosiaris) >>! In T361483#9679703, @elukey wrote: > Hello! > > There `ores_cache` job should be defined but d... [13:20:23] (03PS1) 10AikoChou: revertrisk: error handling for batch requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) [13:22:50] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9680093 (10elukey) >>! In T361483#9680024, @akosiaris wrote: >>>! In T361483#9679703, @elukey wrote: >> Hello! >> >> The... [13:26:01] hi Chris o/ [13:52:50] My apologies for the last minute notice, my wife needs to go to the hospital (nothing super serious, she woke up with a painful ear infection) so I have to cancel today's team meeting and my 1:1s after wards. [13:55:58] chrisalbon: +1 np! [13:56:34] Thanks elukey, sorry. I hate when I cancel [13:57:15] np! Take care! [14:02:07] np! [14:47:46] (03CR) 10Umherirrender: [C:04-1] "looks good, small issues to improve" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007862 (https://phabricator.wikimedia.org/T358831) (owner: 10MPGuy2824) [14:50:57] (03CR) 10Kevin Bazira: "The expected results are returned when I run the base_model:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [15:43:22] o/ [15:43:24] back! [15:46:19] o/ [15:52:24] isaranto: I modified the base image to install a couple of apt packages as well [15:54:04] I'm looking at it now [15:54:49] elukey: in my patch I have just copied what you have done, shall I keep also the comments etc? referring to this https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1015297/5/images/amd/pytorch21/Dockerfile.template [15:55:15] sorry just fixed it, it didn't work at first :D [15:55:37] yeah I think we should, it is sad that we have to repeat all that blurb though [15:56:03] maybe we could have a base image that does everything, and smaller dockerfiles that just install pytorch? [16:01:11] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9681165 (10isarantopoulos) I believe we would have the same issue even with poetry in the following scenario: 1. We have torch-rocm installed in the base image 2. the pa... [16:02:45] good idea to include the apt packages! [16:04:37] > maybe we could have a base image that does everything, and smaller dockerfiles that just install pytorch? [16:04:37] the smaller image could be named something similar to python-bookworm or liftwing-bookworm (?) [16:07:04] (03CR) 10AikoChou: "Thanks for testing it! Just to confirm, did you set USE_BATCHER=True when you ran the batch model server? It seems like the error is occur" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [16:26:43] elukey: so I'll w8 for your patch to be merged and then we can merge the pytorch2.1 patch and the hf one, right? [16:27:09] ack! [16:27:34] regarding the base image we'd still have to add the symbolic links after we pip install torch though [16:28:34] yeah [16:28:43] we can live with copy atm [16:28:51] I'll try to sort it out tomorrow [16:28:59] going afk for today folks! Have a nice rest of the day :) [16:30:56] ciao elukey , have a nice evening! [16:32:57] (03CR) 10Kevin Bazira: "I have run it with `USE_BATCHER=True` using the 2 scenarios below and still got the same error:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [17:10:12] (03CR) 10Ilias Sarantopoulos: "@kevinbazira you'd have to se the USE_BATCHER=true when starting the model server as it is an env var used" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [17:41:22] (03CR) 10Kevin Bazira: "Thank you for the clarification Ilias. I have used the commands below and batch_model run successfully:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1016341 (https://phabricator.wikimedia.org/T360406) (owner: 10AikoChou) [18:04:14] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9681807 (10isarantopoulos) I have built the "final" image that also includes `vllm==0.2.7` which will be used for GPU inference optimization. The final size is 11.4GB and cont... [18:05:33] logging off folks, have a nice evening!