[06:56:29] (03PS1) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) [06:58:13] (03CR) 10CI reject: [V: 04-1] RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [07:29:49] (03PS2) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) [08:06:57] Good morning! [08:20:33] o/ morning :) [08:20:53] (03PS5) 10AikoChou: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) [08:22:16] (03CR) 10AikoChou: Add a util function to detect GPU in resource_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [08:37:44] (03CR) 10AikoChou: RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [08:49:15] (03CR) 10Kevin Bazira: RRLA: upgrade KI from v5 to v6 (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010672 (https://phabricator.wikimedia.org/T355742) (owner: 10Kevin Bazira) [09:12:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [09:16:37] I'm deploying the change to ores-legacy in production - finally :) [09:26:13] Morning! [09:26:43] 06Machine-Learning-Team, 10ORES: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9625793 (10isarantopoulos) @He7d3r I have deployed the fix in production and it is working as expected. [09:26:47] o/ Tobias [09:28:45] Of course the full build of the torch whl I started last night failed 20m in without me noticing %-) [09:35:37] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [09:41:57] (03Merged) 10jenkins-bot: Add a util function to detect GPU in resource_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1010515 (https://phabricator.wikimedia.org/T359793) (owner: 10AikoChou) [10:27:11] isaranto: o/ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1008858 [10:27:29] when you have a moment pls review it, thanksss :) [10:28:09] Sure, will review in a bit! [10:55:06] * aiko lunch! [10:58:19] aiko: I reviewed - added just a suggestion [10:58:23] nice work! [11:28:19] (03PS2) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) [11:31:13] * klausman lunch [11:31:48] (03CR) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [11:32:08] 06Machine-Learning-Team, 10ORES: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9625916 (10He7d3r) That is great! Thank you! πŸ˜ƒ [11:41:02] (03CR) 10Ilias Sarantopoulos: "Nice work! This approach with the new Class is quite clean." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:14:26] I'm in a loop of building and failing and rebuilding blubber images :( [12:24:45] :( what is the message for the failure? [12:30:25] tons of stuff! [12:30:51] I couldn't get it to work with poetry and now I circled back to the start to build it with a requirements.txt [12:31:38] so I'm rebuilding everything from the start. The issue is that I probably need to park this work and we can all coordinate on the issue withe the pytorch images as this is going to be blocked anyway [12:31:44] going for lunch! [13:06:01] ok I will also look into it after I finish my current work! [13:18:37] (03PS3) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) [13:26:03] So building the Torch whl for just gfx900 results in a 261M whl file, building with all GPUs supported is 507M. Build time is 1h40m vs. 3h2m [13:26:38] (03CR) 10AikoChou: revertrisk-ml: add a RevertRiskMultilingualGPU object (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1008858 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [13:34:32] 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9626543 (10kostajh) [13:36:11] hello folks! [13:40:05] I am deploying Dragonfly to prod [13:43:08] ok so the config is rolling out via puppet, checked on ml-serve1001 and it looks good [13:49:10] elukey: fun fact, I ran fdupes inside the pytorch build container, and... [13:49:16] root@b0eecbd9946f:/# fdupes -r -n -1S -m . [13:49:18] 17751 duplicate files (in 10896 sets), occupying 456.2 megabytes [13:50:02] hey Luca! [13:51:01] klausman: sure but what about the torch dir? (That seems the root dir, not sure what mess the build container has..) [13:51:08] or maybe I didn't get your point sorry [13:51:26] The question I have is: what from the build dir is _actually needed_ [13:51:49] There probably are a whole pile of .o files and the like there. [13:51:51] in theory only the .whl file that is produce counts [13:51:57] *produced [13:52:10] That I got from 507M to 261M as mentioned [13:52:22] Just by setting gfx900 as the target arch [13:52:45] didn't see the msg, but it is in line with what I got for two archs [13:52:56] do you have the rocm binaries inside the whl? [13:53:03] because I didn't get them [13:53:12] rocm-smi rocm_agent_enumerator rocminfo [13:53:16] that was the confusing part, does it mean that it relies on the system ones? [13:53:18] These are on tab completion [13:53:40] lemme pastebin the contents of /opt/rocm-6.0.0/bin/ [13:53:48] no I mean when you install the .whl, in the site-packages dir of the venv, you should have torch/libs [13:54:00] sec [13:54:05] and inside that dir the upstream package deploys the rocm binaries [13:54:17] that is basically what causes torch to weight several GBs [13:54:49] no torch binaries, but a whole lot fo HIP stuff [13:54:58] er, no rocm bins [13:56:13] torch/lib/libtorch_hip.so is the biggest chunk in there, at 394'882'736 bytes [13:56:19] and if you run some test py code, does torch recognize the gpu? [13:56:28] Haven't tested that yet [13:56:41] ah okok super, this is where I got stuck (didn't find where to test it) [13:56:56] if it works, it must mean that torch uses the system libs [13:57:05] I run a custom kernel on that machine, so there is some extra spicyness :) [13:57:06] but then I wonder how the upstream wheels are built [15:17:07] isaranto: https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py is what I was talking about to move CUDA -> HIP occurrences in pytorch [15:17:44] ack! [15:18:42] as we dig into this it gets more custom! [15:21:21] elukey: on thing I can already see in ther upstream whl that ours don't have is torch/lib/rocblas [15:26:35] yes yes this is what I noticed as well [15:26:55] basically it misses all the rocm .so libs right? Except few [15:27:05] The torch-2.2.1-rcom5.7 whl is 1.6 (packed), so more than 3x what we've built [15:27:14] 1.6G* [15:27:37] and it is compressed, uncompressed is way worse [15:29:24] Yes, about 30 .so's are in the upstream whl but not in hours [15:29:50] libdrm, libhip*, librocm* and a few small ones [15:30:08] either there is some cmake magic that we don't see in pytorch's repo that bundles the lib, or something else happens in their CI pipeline when they produce the wheels [15:31:14] I'll update the task today or tomorrow with our findings so far. And then I'll maybe poke upstream about this. If we're lucky, we find more things we can remove that they just ship for compat/convenience [15:32:01] (and if we're insanely lucky, they'll make ewhole package set more modular. not likely) [15:41:10] something really interesting after reading https://phabricator.wikimedia.org/T264209 - The limitation on the layer size has another dimension, namely before pushing a layer docker compresses it [15:41:24] meanwhile the layer size in docker history $image-id is the uncompressed one [15:41:37] this may explain why sometimes we are able to push images with big layers [15:41:52] of course there seems to be no way to get the compressed size of a layer sigh [15:45:22] ah yes [15:45:29] docker manifest inspect docker-registry.wikimedia.org/amd-gpu-tester:0.0.15-1 [15:45:37] this shows that the biggest layer size is ~1G [15:45:42] meanwhile uncompressed is 10G [15:46:03] ok that explains the mistery [15:53:12] TIL about docker manifest inspect! [15:54:33] `docker manifest inspect docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-llm:stable` [15:54:38] 2GB layer! [15:55:22] * elukey nods [15:55:28] yes now it makes more sense [16:01:22] klausman: https://github.com/pytorch/builder/blob/main/manywheel/build_rocm.sh should be how torch upstream builds the giant wheels [16:01:35] see line 77, ROCM_SO_FILES [16:06:24] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9627061 (10elukey) I found this build script: https://github.com/pytorch/builder/blob/main/manywheel/build_rocm.sh It should be how upstream packages the giant wheel files, includin... [16:16:44] aaah, those STATIC vars are probably part of the problem [16:20:07] I'll definitely give that a spin tomorrow, see what the sizes are [16:23:35] Do you folks have +2 access on puppet repo? [16:23:46] if this is ok can you please merge it? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1010245 [16:25:12] Looking [16:25:37] thank u Tobias! πŸ™ [16:27:29] And merged. [16:28:55] Danke schΓΆn! [16:29:14] Gern geschehen :) [16:30:25] ok so really nice news [16:31:07] πŸ₯ [16:31:12] I tried to use `docker save` to create a .tar with the locally built docker image related to revert risk, that Aiko is working on [16:31:31] then I gzipped it, to get a ballpark measure of the compressed size of the image [16:31:36] and the result is 2.1GB [16:31:48] meanwhile the .tar is ~10G [16:32:34] so in theory, if what I did above makes sense, we are failing to push that image since the docker registry accepts up to 2GB [16:33:05] so we "just" need to save 0.1G or so? [16:33:23] now this could be something that we ask to ServiceOps - can we increase the tmpfs size for nginx to 3/4GB, to have room for more space? With the caveat that we'll create a base image etc.. [16:33:27] + dragonfly [16:33:42] klausman: 0.1G compressed, that I am not sure how much it corresponds [16:34:24] in theory if we bump the registry vms to get +2GB of ram (I think doable) and if we expand tmpfs for ngnix, we should be good for some time [16:34:41] but of course it may be a never ending game, this is to unblock us temporarily [16:34:45] does it make sense? [16:35:13] Yeah, I think a short term bump (kicking the can down the road) is entirely fine. [16:35:26] I am pretty sure RAM go cheaper since when those VMs got sized initially ;) [16:36:24] this would buy us time to test our images, and then to come up with requirements when serviceops will rethink the registry (during the next quarters) [16:36:57] ok I'll try to write up a proposal in the main task, let's keep going with our tests though [16:38:23] Ack [16:39:01] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9627239 (10isarantopoulos) I managed to build the huggingface image with blubber and downloading the specified model from HF (example with [[ https://huggingface.co/google-ber... [16:40:01] 06Machine-Learning-Team, 10ORES: 14Add httpbb tests for ores-legacy - 14https://phabricator.wikimedia.org/T359871#9627245 (10isarantopoulos) 05Openβ†’03Resolved [16:40:03] 06Machine-Learning-Team, 10ORES: 14Inconsistent data type for articlequality score predictions on ptwiki - 14https://phabricator.wikimedia.org/T358953#9627247 (10isarantopoulos) 05Openβ†’03Resolved [16:49:27] elukey: totally makes sense! [16:53:52] +2 from me as well [16:56:42] thanks for the reviews folks :) [16:56:44] posted to https://phabricator.wikimedia.org/T359067#9627299 [16:57:25] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9627299 (10elukey) @akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :) I have a proposal to unblo... [17:01:08] isaranto: readability in staging doesn't hang anymore :) [17:01:53] great! sorry for missing that patch. was meaning to do it today but got caught with other stuff [17:04:06] isaranto: please don't say sorry, you are doing 100 other things :D [17:04:20] I said I'd do it! [17:05:12] all right going afk for today folks! [17:05:17] have a nice rest of the day! [17:07:07] I need a fresh mind! I'm out for the day as well. cu tomorrow folks o/ [17:22:05] And me, three :) \o [17:24:32] logging off as well. have a nice evening folks :)