[03:44:48] (03PS5) 10MPGuy2824: Migrate usage of Database::select to SelectQueryBuilder in ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) [09:01:44] Good morning! [10:09:21] I'm looking into why CI didn't run for readability model in inf services repo [10:18:52] found it! we renamed the directory without changing the configuration in the config repo [10:19:29] I mean integration/config [10:35:24] Fixed it here -> https://gerrit.wikimedia.org/r/c/integration/config/+/1009218 [11:08:55] 06Machine-Learning-Team: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9605540 (10isarantopoulos) [13:04:28] 06Machine-Learning-Team, 10ORES: Inconsistent data type for articlequality score predictions on ptwiki - https://phabricator.wikimedia.org/T358953#9606178 (10isarantopoulos) I found that this is caused because of the mixed schema of the responses returned by ORES. The `prediction` field is either a boolean, a... [13:09:57] * isaranto lunch! [14:04:50] Good morning all [14:05:03] o/ [14:06:20] \o [14:09:55] 06Machine-Learning-Team, 13Patch-For-Review: Test revertrisk-multilingual with GPU - https://phabricator.wikimedia.org/T356045#9606697 (10elukey) To keep archives happy - we are discussing about the macro problem in T359067 TL;DR: when we pip install torch for ROCm, a ton of .so libraries are shipped with the... [14:23:30] Protip, when your doc says "it's just a quick test", that may mean that the test is quick, but you get to sit in the waiting room for over an hour waiting for the _results_ :-/ [14:26:12] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9606960 (10elukey) Hi! Thanks a lot for the response, replying inline :) >>! In T359067#9602091, @akosiaris wrote: > Hi, > > So this is a difficult one to tackle... [14:26:52] klausman: when I go to my GP I wait an avg of 40/50 mins, even if I have an appointment :D [14:27:07] I started to get books with me [14:27:39] yeah, complaining at a high level. If I show up 5m before the appointment, I am always called up within 10m of that [14:27:53] very nice :D [14:28:12] isaranto: ignorant me asking - what do you use to install torch + rocm via pip? [14:28:21] I tried pip3 install -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0.2/ torch but I see nvidia stuff installed [14:29:06] klausman: not sure if you saw https://phabricator.wikimedia.org/T269684, another thing that we'll need to do soon-ish [14:29:14] (seems a move to containerd) [14:29:49] Yes, it came up during the k8s SIG meetings. [14:31:21] AIUI, the replacement (containerd) is a mostly drop-in replacement. But we all know how that usually goes. [14:32:42] And of course the migration path (do we need to reimage?) is going to be interesting. [14:33:48] elukey: `pip/pip3 install --extra-index-url https://download.pytorch.org/whl/rocm5.4.2 torch` [14:35:25] klausman: I saw people talking about it in #serviceops IIRC, they are thinking what strategy to use (coupling it with an upgrade or not) [14:35:34] wikikube have now ~130 nodes :D [14:37:35] isaranto: thanks! I have a bookworm container and pip install like you suggested brings in, afaics, nvidia stuff and non-rocm torch [14:37:39] not sure why [14:38:42] can you give an example? iirc there are some nvidia stuff in it (or named after cuda). [14:39:14] not sure why but cuda has become like the universal word for GPU [14:39:38] so I see pip emitting [14:39:39] Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/rocm5.4.2 [14:39:48] and I suspect it picks up the former [14:40:02] so it downloads pypi's metadata, and installs "regular" torch [14:40:28] pkgs like [14:40:29] Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in ./lib/python3.11/site-packages (from torch) (11.0.2.54) [14:40:32] Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in ./lib/python3.11/site-packages (from torch) (10.3.2.106) [14:40:58] pip 24.0 [14:42:08] ah ok I need to use --index-url [14:42:14] now it seems to work :( [14:42:24] sorry that's what I was going to say [14:42:50] * elukey being dumb sorry [14:42:55] but if you have more packages e.g. in a requirements txt then they wont be found (if I'm not mistaken) [14:43:14] thinking out loud - in theory we expose the GPU to containers via the amd k8s plugin [14:43:16] good thing is that in poetry that is solved. for example you can specify repository per package [14:43:27] but not the drivers etc.. [14:43:49] so there shouldn't be anything preventing us from say using the most recent torch + rocm version [14:43:55] (like rocm 6.x etc..) [14:44:15] does it make sense? [14:44:22] yep I think so [14:45:02] I was meaning to try that for the huggingface image cause it needs toch 2.1.2 which doesnt have a wheel for rocm5.4.2 [14:46:26] okok, not great for consistency with "bare metal" but more flexible [14:48:50] I added some thoughts about the docker image layer issue in https://phabricator.wikimedia.org/T359067#9606960 [14:55:10] ok with index-url it eventually fails since it doesn't find setuptools [14:55:33] now I am very confused [14:56:02] 06Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9607328 (10kevinbazira) @Seddon and @Isaac, the article-descriptions inference service is now live in Li... [16:14:41] elukey: sth I don't like from pip documentation `There is no ordering in the locations that are searched.` :) [16:15:34] 06Machine-Learning-Team: Add Dragonfly to the ML k8s clusters - https://phabricator.wikimedia.org/T359416 (10elukey) [16:15:48] isaranto: yes everything seems to mock me multiple times today [16:15:49] * isaranto bbiab [16:40:06] I got caught up with the ores-legacy app today, haven't worked on the docker image for HF [16:40:30] but in the way that poetry works, I don't know if we could break it into multiple layers [16:45:40] yeah I don't think we can [16:45:49] it will be part of a single RUN [16:47:52] I also found https://github.com/moby/buildkit/issues/1214 [16:48:03] to understand the diff between buildx and DOCKER_BUILDKIT=1 [16:48:15] it is also counter intuitive to what dependency management actually does [16:49:13] blubber uses buildkit right? [16:49:47] in theory yes [16:50:18] thanks for sharing the above! [16:50:22] but for some weird reason, without buildx poetry doesn't install the right torch package (the rocm one) when building rr-ml [17:00:22] ok I am not totally off track [17:00:23] https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual/332/consoleFull [17:00:33] I don't see the torch rocm variant in here [17:00:55] that matches with my local build with DOCKER_BUILDKIT=1 [17:01:06] that is what CI uses afaics [17:01:41] I see this `downloading torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl.metadata` [17:02:19] but Aiko was able to get the right one via docker buildx [17:02:24] locally I mean [17:06:22] I suggest let's go with what CI says [17:07:09] what is the patch you're trying? [17:08:04] 06Machine-Learning-Team, 13Patch-For-Review: Test revertrisk-multilingual with GPU - https://phabricator.wikimedia.org/T356045#9607986 (10elukey) @achou something interesting! I checked the CI output of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1006909: https://integration... [17:08:12] I added all the infos --^ [17:08:34] it is interesting since we may not used buildx in the future if there are differences [17:09:44] going afk folks, see you tomorrow! [17:10:28] ciao Luca! [17:10:35] 06Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9607992 (10Isaac) This is really wonderful news! Thanks @kevinbazira for slogging through this with us a... [17:34:13] 06Machine-Learning-Team, 10ORES, 07Technical-Debt: Replace usage of wfGetDB() in ORES before the 1.42 cut so it can be hard-deprecated - https://phabricator.wikimedia.org/T357654#9608151 (10matmarex) 05Open→03Resolved Resolved in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/1003397, I don... [17:35:07] 06Machine-Learning-Team, 13Patch-For-Review: Test revertrisk-multilingual with GPU - https://phabricator.wikimedia.org/T356045#9608156 (10isarantopoulos) As I see torch is being downloaded from pypi. Although I don't know exactly why this happens but it seems that the extra index (source in terms of pyproject.... [17:38:32] (03PS1) 10Ilias Sarantopoulos: revertrisk-multilingual: add extra index for torch rocm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009312 (https://phabricator.wikimedia.org/T356045) [17:42:28] I found sth regarding the image --^ [17:42:34] going afk folks o/ [17:59:45] \o