[05:50:56] (03PS6) 10MPGuy2824: Migrate usage of Database::select to SelectQueryBuilder in ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) [06:04:16] (03CR) 10MPGuy2824: Migrate usage of Database::select to SelectQueryBuilder in ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) (owner: 10MPGuy2824) [06:38:45] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [06:39:39] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [07:51:30] Good morning folks :) [09:37:37] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [09:38:31] (03Merged) 10jenkins-bot: ores-legacy: fix mixed boolean and string field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009470 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [11:06:25] (03CR) 10Ladsgroup: [C: 03+2] "\o/" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) (owner: 10MPGuy2824) [11:23:20] (03Merged) 10jenkins-bot: Migrate usage of Database::select to SelectQueryBuilder in ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) (owner: 10MPGuy2824) [12:43:32] * isaranto lunch! [13:27:01] hello folks! [13:33:49] hey Luca! [13:38:51] (03PS1) 10Ilias Sarantopoulos: revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 [13:40:51] (03PS2) 10Ilias Sarantopoulos: revertrisk: remove obsolete step from README [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009723 [13:42:51] lol, wikibugs is lagging a bit. never really looked into how it works [13:43:33] wow wikibugs is 20 years old! [13:55:24] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9615018 (10akosiaris) >>! In T359067#9606960, @elukey wrote: > Definitely it is a weird use case. Pytorch decided to ship their version for AMD GPUs with all the... [14:25:38] tried to strip symbols (creating tmp files) on ml-staging2001, for two big .so files [14:25:45] few MBs saved, sigh [14:25:48] --^ tbh I didn't think that we would have registry problems. [14:26:27] I was thinking about latencies, GPUs and whatnot [14:26:59] that's life I guess. some experience gained here ¯\_(ツ)_/¯ [14:27:24] I worried about it a little but it was the last problem to think about, now it is biting us :D [14:27:38] everything in the critical path is a liability, without the registry no deployments [14:30:40] I am going to deploy Dragonfly in staging [14:30:49] if everything doesn't work anymore it is my fault :D [14:32:03] lemme deploy ores-legacy first [14:32:16] too late :) [14:32:23] gimme 5 mins, I'll rollback if needed [14:32:34] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9615247 (10JMeybohm) >>! In T359067#9615018, @akosiaris wrote: >> Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly... [14:32:37] the main issue is during deployments [14:33:36] nevermind we didn't really sync [14:33:53] cool cool, I also deployed so we'll see [14:34:27] ok, my deployment was smooth! [14:34:57] yeah but I was rolling out via puppet the new settings :D [14:35:05] so not sure if the test was 100% reliable [14:35:23] now Dragonfly sits between kubelets and the docker registry [14:40:23] sorry I didn't think it as a test for Dragonfly [14:40:35] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9615272 (10elukey) Ran a little test to see if stripping symbols from ROCm libraries could give us some space benefit: ` elukey@ml-staging2001:~$ du -hs /opt/rocm-5.4.0/lib/* | sort... [14:40:36] nah no problem! [14:40:55] klausman: https://phabricator.wikimedia.org/T359569#9615272 :( [15:09:38] I just built a blubber image with torch2.1.2-rocm5.5 and kserve 0.12 (the versions we need for huggingface server) and the image is 5.5GB [15:10:33] all in one layer? [15:10:40] I mean the most of it [15:10:54] not sure how the size is so small [15:11:22] I am trying to follow https://docs.amd.com/en/docs-5.0.0/how_to/pytorch_install/pytorch_install.html to build the wheel with less supported gpus [15:11:26] and compare [15:11:32] me neither. I need to install some other stuff so I'll update the task later [15:11:43] these are the packages https://phabricator.wikimedia.org/P58691 [15:12:08] is the rocm version installed though? [15:12:15] and I have no idea what nvidia is, perhaps it has sth to do with how pytorch handles gpu [15:12:36] do you see rocm libs inside /opt/lib/python/site-packages/torch ? [15:12:45] I suspect this is the standard torch installed [15:15:18] yep false alarm. there is def sth off [15:16:08] I think I was just happy to see it :) . the bullseye version is 12.4GB and this was with bookworm, so not sure what happened there [15:16:48] good that we found a consistency, otherwise I'd have flipped my laptop over the table and log off for the weekend :D [15:16:54] I'm thinking to start exploring poetry for these builds to make things more explicit when it comes to where we fetch dependencies (not now though) [15:17:14] I saw blubber supports poetry (at least saw it in the changelog) [15:26:01] could be good to have less surprises, for example in the future we may be in trouble if from one build to the other one we add nvidia stuff and not rocm libs [15:27:09] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9615368 (10elukey) Dragonfly deployed to staging, now we need to test it and see how it works :) [15:31:48] elukey: aw, that's a bummer. Not nothing saved, but nowhere near what I'd hoped. [15:35:58] recompiling torch is taking ages, but it seems easy enough from the upstream image [15:36:10] we'll see with less gpus supported [15:36:59] I tried yesterday, but all my machines are on trixie, so it all came apart. nd Iw asn't sure whether e.g. build2001 was the right place for this kinda thing [15:37:49] I am trying this one https://docs.amd.com/en/docs-5.0.0/how_to/pytorch_install/pytorch_install.html#option-3-install-pytorch-using-pytorch-rocm-base-docker-image [15:38:27] That's the one I tried, I think, but it failed because something was missing, IIRC [15:39:03] mmm but how trixie played a role? [15:39:04] elukey: is build2001 the right place for such things? Or do we have a better host for heavy compiles? [15:39:32] Oh wait no, I tried the lernmaschine one. [15:39:50] lernapparat* [15:39:55] ah okok [15:40:39] It also doesn't help with testing that rocm-dkms won't install on trixie, so I couldn't explore much on the actual machine (which does have a decent AMD (gaming) GPU) [15:40:56] usually I try a container first for experiments, on build2001 we can't really install all that we need etc.. [15:41:16] I also didn't want to eat all the CPU on that VM [15:50:17] if you do it for a long time it may bring some screaming from SREs :D [15:51:34] the process to bring pytorch on ROCm is nice, they basically replace CUDA with HIP and rebuild [15:51:50] kinda sad how CUDA is standard and not open sigh [16:02:15] very interesting that now most of the rocm libs are on trixie [16:02:35] I am checking the size of the libs, definitely way less than ours [16:03:17] 5.5 rocm series [16:07:29] 06Machine-Learning-Team: Investigate if it is possible to reduce torch's package size - https://phabricator.wikimedia.org/T359569#9615581 (10elukey) Debian Trixie (testing) offers ROCm 5.5 packages, and so far the size seems better than vanilla upstream ones: ` root@e055b7f3f246:/# du -hs /usr/lib/x86_64-linux-... [16:07:47] added some info to --^ [16:12:26] Mh, I see errors with the amd.com install/build docs [16:12:36] /opt/rocm-6.0.0/include/hip/hip_runtime.h:66:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); [16:12:39] 66 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); [16:12:57] (yes I ran the tools/amd_build/build_amd.py script [16:13:20] The docs also mention `./.jenkins/pytorch/build.sh`, whcih I don't have [16:14:46] Ah, I think it's ./.ci/pytorch/build.sh [16:16:27] maybe let's split work to be more "parallel", we are doing the same thing :) [16:16:56] Well, I feel liek I need to have the base thing working before I can try exploring tweaks to make it smaller [16:17:41] I mean let's decide how to split the work on this task, if we want to do it together, to avoid overlapping too much [16:18:11] I can let you do it and work on something else, but it feels that we are working on the same thing [16:18:23] right. So we've established that stripping is going to help much. [16:18:30] happy to do it, lemme know :) [16:18:31] not* [16:19:07] My next step would be to see how much savings we get from only selecting one GPU type (PYTORCH_RCOM_ARCH). [16:19:55] that is what I am trying to do as well [16:20:11] this is why I brought it up [16:20:12] Beyond that, I'd want to know what the biggest chunks of a single-GPU image are, and have a close look at their names, to see if there are any obvious things we could ditch, while not breaking PyTorch. The tricky bit there are imports that are unused, but would break because the import fails [16:21:13] If you want to do that (plus other ideas I've missed), please go ahead. I'll bang my head against puppet and the deployment charts, (for Cassie) then [16:21:46] nope this is not what I meant, it is just to avoid doing the same task/research/etc.. [16:22:08] I can work on other things if you want to take the lead, but we should sync in here to be more efficient [16:23:00] lemme know if it makes sese [16:23:02] *sense [16:23:41] 06Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9615603 (10kevinbazira) I noticed that in KI v6, pydantic data models were added to the `BaseRevision` class in the knowledge integrity schema. The `get_revision... [16:24:09] I'm happy to let you do this (pytorch image reduction), I just looked at it because I was curious. If I happen to have ideas about things to try, we'll coordinate. Or am I missing something? [16:26:11] that is good :) My only point is that the backlog should be split among us so we can work on different tasks in parallel, and sync if we want to split the work on some project for specific reasons [16:26:29] feel free to keep building, just keep in mind to sync beforehand :) [16:27:34] (Ilias is working on it too, but from the angle of HF and blubber etc.. for example) [16:28:22] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416#9615637 (10elukey) a:03elukey [16:28:40] I think you probably have more/fresher knowledge about the current state/what we're trying to do/ what doesn't work, so I think your time is more efficient. I would have to start from (not quite complete, but close to) scratch [16:29:21] no problem from my side since if you like the task and you want to learn these bits I am more than happy to let you do it [16:29:26] we are not in a big rush [16:30:24] I want to have the Cassie caching prototype in prod sooner rather than later, and I think I have almost all of the "squishy" knowledge in my head, so I think my time is better spent on that. [16:30:53] ooook but please don't take what I wrote above as "you shouldn't work on it", it wasn't the message that I wanted to pass [16:31:03] It's not how I read it [16:31:05] rather let's sync often [16:31:07] super :) [16:31:21] (easy to misread on IRC so I add semantics just in case :D) [16:31:31] I *do* have a "Squirrel!" problem with stuff like this :D [16:32:06] I found out sth WONDERFUL which cost me several hours today [16:32:26] can someone please try the following? [16:32:36] go to https://download.pytorch.org/whl/rocm5.5 [16:32:52] and click on torch and let me know where that redirects you [16:33:06] https://download.pytorch.org/whl/torch/ [16:33:12] same [16:33:27] ok, then it is not me/my network on anything [16:33:29] thanks! [16:33:44] it should go to https://download.pytorch.org/whl/rocm5.5/torch/ which has the rocm packages [16:34:10] I mean on that (pt.org) page there are rocm packages [16:34:16] I built an image yesterday and got the correct one, but the ones I build today with whatever debian version all got standard torch [16:34:23] e.g. https://download.pytorch.org/whl/rocm5.7/torch-2.2.1%2Brocm5.7-cp39-cp39-linux_x86_64.whl [16:34:50] But also https://download.pytorch.org/whl/rocm5.5/torch-2.1.2%2Brocm5.5-cp39-cp39-linux_x86_64.whl#sha256=383ac7cc56df8184072e8c80ccd9d863e2651b62c095b2df608c3297007323e2 [16:35:28] yeah it has all of them [16:36:53] the 5.6 link was working properly until 3 minutes ago. my head is going to explode [16:37:20] What do you mean "worked properly"? How does it fail now? [16:37:24] anyway, now that I found it I'll figure it out. I wanted someone to check and share my friday evening frustration [16:37:48] I mean that going to https://download.pytorch.org/whl/rocm5.6 and clicking on torch would take me to https://download.pytorch.org/whl/rocm5.6/torch [16:38:03] now it goes to https://download.pytorch.org/whl/torch/ [16:38:54] on https://download.pytorch.org/whl/rocm5.6/ clicking on torch takes me to https://download.pytorch.org/whl/rocm5.6/torch/ [16:39:01] maybe only the 5.5 dir is b0rk? [16:39:35] _or,_ someone at pytorch org was doing late-Friday maintenance and you were unlucky :) [16:39:39] dunno. my 5.6 doesnt work at the moment [16:39:47] 5.4.2 works though [16:40:17] I have a suspicion that this is a cluster of d/l servers, and one of them is out of sync [16:40:26] this is why I ended up with a small docker image though. At least I'm happy to find out [17:04:21] the 5.6 got fixed for me now as I was preparing an issue for pytorch GH [17:04:29] 5.5 though still doesnt work [17:11:44] 06Machine-Learning-Team: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9615755 (10isarantopoulos) Some of the links from the pytorch repositories seem to be wrong today (at least this is when I noticed it). The links under https://download.pytorch.org/whl/rocm5.5 shou... [17:13:06] going afk folks, have a nice weekend! [17:14:52] \o enjoy your weekend [17:14:59] \o [17:25:01] two brain farts [17:25:30] 1. never added the --extra-index-url in front of a pip install and was wondering why the docker image was failing [17:25:55] 2. rm -rf /* instead of rm -rf ./* [17:30:18] (03PS1) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [17:30:46] Well I'm parking this WIP here and going afk folks, have a nice weekend! [17:44:40] Ouch some of my stuff got deleted.