[09:10:24] morning :) [10:22:55] Morning Aiko :) [12:11:50] * klausman lunch [13:07:19] hello folks! [13:07:25] Ohai Luca [13:19:47] elukey: I did some poking and prodding regarding image size. For one thing, the iamge built by the manywheel script is 28G(!) even with only two archs enabled. On a whim, I also checked whether the binaries for rocm can be stripped, or even upx-packed, but upx does not understand the binaries (they're not simple .so files). Ther eis a CUDA-specific tool called nvprune which allows you to [13:19:49] remove unneeded gpu support (arches) from the binaries for CUDA, but it's not compatible with rocm (and I doubt it would help with single- or dual-arch builds anyway). [13:21:31] already too much work, I think you can update the task with pros/cons and we can wrap up [13:21:34] what do you think? [13:22:04] Yeah, sounds good. [13:22:16] (I mean great work, I meant that it seemed already a too big of a deal to be viable for our team) [13:22:25] I understood :) [13:22:30] okok :) [13:22:31] There's still value in "here' [13:22:42] *"here's a bunch of stuff that doesn't work for us" [13:40:08] Good morning all [13:40:19] heyo Chris [13:40:26] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9654014 (10klausman) During some experimentation with various approaches of generating the Docker images differently, and stripping out unneeded information, I have tried the followin... [13:51:20] hello! [13:58:58] ok so next week we should hopefully have the memory bump in place for all the docker registry hosts, plus the new pytorch base image to start experimenting with [14:00:22] \o/ [14:58:45] wow, I just got an error when testing a model server locally "ImportError: cannot import name 'RayServeHandle' from 'ray.serve.handle’" [14:59:00] looks like kserve has compatibility issue with the the latest ray-2.10.0 which was 19 hours ago [14:59:38] a temporary solution for us is to add `ray[serve]<=2.9.3,>=2.9.2` before kserve in the requirements.txt [15:00:21] someone has already opened the issue to the upstream https://github.com/kserve/kserve/issues/3541 [15:04:15] nice! [15:14:41] (03PS3) 10AikoChou: revertrisk: improve error messages [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1011305 (https://phabricator.wikimedia.org/T351278) [15:15:26] aiko: o/ qq if you have time [15:15:35] do you test rr-ml locally by any chance? [15:16:09] I am asking since we could set up a nice test with https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1013335 [15:16:49] we use a tool called docker-pkg to build the production-images, it is available via pypi and very easy to use (I can give you some examples) [15:17:27] you'd be able to have the docker image with pytorch+rocm locally, and with that you should be able to use it in blubber to build rr-ml [15:17:45] rr-ml in this case should be modified to not pip install pytorch [15:18:05] if everything works fine we'd demonstrate that the base image with pytorch works fine with blubber [15:18:09] does it make sense? [15:18:16] (of course we can do it on monday) [15:19:17] elukey: makes sense! I can test it [15:22:38] can you provide me some examples to build prod-images? [15:25:28] aiko: sure! So you'd need to checkout the prodution-images repo, then you cherry pick my change on it. You can then create a python venv and pip install 'docker-pkg' in it [15:25:48] then change dir to the local production-images repo, and run something like [15:26:14] `docker-pkg build images/ --select *pytorch*` [15:26:31] when done you should see the docker image built locally [15:28:25] I see [15:44:41] Hmm I got a docker error when building the image [15:44:46] docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory')) [15:44:50] 🤔 [15:45:24] you should have a docker build log in the dir where you run the command [15:47:18] the log is empty, I think it didn't start the build. something is off for my docker setting [15:51:50] https://phabricator.wikimedia.org/P58898 [15:54:30] aiko: yeah I think it tries to connect to you local docker via unix socket but it fails [15:54:33] is docker up? [15:57:40] yes it is up [16:01:14] I'm gonna restart it and try again [16:10:42] any luck? (sorry I was afk) [16:12:03] no :( that's weirrrrrd [16:18:49] aiko: mmm so do you have anything like /var/run/docker.sock ? [16:21:30] there are also some info in [16:21:31] https://forums.docker.com/t/docker-errors-dockerexception-error-while-fetching-server-api-version-connection-aborted-filenotfounderror-2-no-such-file-or-directory-error-in-python/135637/6 [16:22:02] maybe the unix socket is not allowed and/or requires perms on macos? [16:23:40] thanks, reading [16:23:46] I have .docker/run/docker.sock [16:24:36] aiko: mmm maybe export DOCKER_HOST=unix:///home/USERNAME/.docker/desktop/docker.sock ? [16:24:47] replacing USERNAME etc. [16:27:25] ohhhhh it works [16:27:33] \o/ [16:28:59] thankssss Luca \o/ I also needed to set the "Allow the default Docker socket to be used" [16:29:49] super :) [16:29:53] thank you for the patience! [16:32:17] ahhh ERROR: image docker-registry.wikimedia.org/amd-pytorch22 failed to build, see logs for details [16:32:39] 2024-03-22 17:31:37 [docker-pkg-build] ERROR - Unexpected error building image docker-registry.wikimedia.org/amd-pytorch22:2.2.1rocm5.7-1: Image docker-registry.wikimedia.org/python3-bookworm not found (image.py:208) [16:34:29] aiko: mmm if you docker pull it? [16:34:41] maybe it wants it to be local [16:37:58] I pulled it and tried to build again but it failed [16:39:12] aiko: very weird, do you have the full stack trace by any chance? [16:39:20] sorry I hoped for a smoother testing :( [16:41:43] https://phabricator.wikimedia.org/P58899 here [16:41:58] no worries :D [16:45:35] very weird [16:47:48] klausman: have you encountered a similar issue? I vaguely remember us debugging something similar [16:47:58] when you worked on the kserve images [16:48:03] but I may misremember [16:49:54] I am trying to remember. [16:50:13] I think my error was something else, but for verification, I am trying a build with your cherrypick right now [16:51:41] aiko: if you run docker-pkg with --debug, does it tell you anything useful? [16:52:00] `* Built image docker-registry.wikimedia.org/amd-pytorch22:2.2.1rocm5.7-1` [16:52:02] worked fine [16:52:14] I suspect something is not working on macos [16:52:23] yep, that would be my guess as well. [16:52:38] thanks for checking :) [16:59:29] elukey: https://phabricator.wikimedia.org/P58900 no idea.. :( [17:00:17] localhost:None seems a key aspect [17:01:03] ah, nvm, that's normal when using the domain socket (unix:///) [17:02:22] aiko: have you tried the --select variant of the command line, that Luca mentioned? [17:02:31] i.e. docker-pkg build images/ --select *pytorch* [17:02:56] In my experience, the selection of what to build never quite works the way I think, unless I use --select [17:03:42] that doesn't work. I got zsh: no matches found: *pytorch* [17:03:50] ah, put it in single quotes [17:03:55] '*pytorch*' [17:04:23] zsh is trying to be helpful: it thinks you're referring to a file, when it should just send the * to docker-pkg [17:05:36] ohhhh thanksss! it's building... [17:05:41] \o/ [17:05:48] let's see [17:07:18] nice :) [17:07:45] build succeed!! [17:07:48] yayyyy [17:08:31] nice! [17:08:45] Hooray! [17:08:47] thank klausman for spotting that [17:09:03] we can pause the testing and do the rest on monday aiko [17:09:17] next step is to see if the pytorch image works in blubber as expected [17:09:34] with rr-ml (seems to be the most prominent use case) [17:09:51] okay! I'll continue on Monday :) [17:09:56] thankssss [17:10:04] all right going afk for the weekend folks [17:10:10] have a nice rest of the week :) [17:10:21] have a nice weekend o/ [17:10:28] \o [17:10:39] I'm going afk as well [17:10:50] same, same :) [17:16:26] bye folks! [21:58:34] night all [23:49:55] (03PS1) 10Umherirrender: Add explicit parenthesis around mixed boolean operator [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1013647 [23:53:38] (03PS2) 10Umherirrender: Add explicit parentheses around mixed boolean operator [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1013647