[07:16:37] (03PS1) 10Kevin Bazira: Makefile: add support for langid [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) [07:25:29] (03CR) 10Kevin Bazira: "To make reviewing easier, here are the commands I used to test the langid model-server build:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [08:16:33] Good morning * [10:12:03] (03PS6) 10AikoChou: revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) [10:15:50] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [10:22:45] (03Merged) 10jenkins-bot: revertrisk: use GPU for revertrisk-multilingual [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/995214 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [10:26:44] Morning! [10:39:15] o/ Tobias! [10:41:41] Hey Ilias :) [11:01:29] post-merge failed 0.0 [11:02:51] waiting for it retrying.. [11:07:22] 10Machine-Learning-Team: Support running revertrisk-multilingual model-server via Makefile - https://phabricator.wikimedia.org/T356501 (10achou) 05Open→03Resolved [11:13:36] it seems that it stopped while pushing the wikidata image [11:35:07] mmm but I saw the multilingual-publish failed https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/66/execution/node/59/log/ [11:37:08] I'm going to manually build [11:38:35] mm seems not possible [11:42:13] (03PS1) 10Ladsgroup: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) [11:43:08] I started one - let's see https://integration.wikimedia.org/ci/job/inference-services-pipeline-revertrisk-multilingual-publish/ [11:44:02] (03PS2) 10Ilias Sarantopoulos: Makefile: add support for langid [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [11:44:15] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Makefile: add support for langid [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [11:44:17] ah thanks! [11:44:36] kevinbazira: I just rebased the above patch. LGTM! nice job [11:45:40] isaranto: great! thanks for the review :) [11:45:58] (03CR) 10Kevin Bazira: [C: 03+2] Makefile: add support for langid [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [11:47:06] (03Merged) 10jenkins-bot: Makefile: add support for langid [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003032 (https://phabricator.wikimedia.org/T357382) (owner: 10Kevin Bazira) [11:47:16] (03CR) 10CI reject: [V: 04-1] Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup) [11:48:58] * isaranto going for lunch! [11:56:15] (03PS2) 10Ladsgroup: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) [11:58:35] * klausman lunch [12:10:41] the image seems to take too long to build [12:28:24] oh my these gpu images take a ton of time to build [12:28:34] Any idea why? [12:28:46] I'm building one locally to validate the new requirements for llms [12:29:05] I mean, the whole ROCm/Torch stack is big, so I'd expect it to take a bit, but I wonder if something else is going on. [12:30:35] mainly cause of rocm, can't think of any other reason. the python package by itself is 1.5GB , so downloading it takes time (let alone installing) [12:31:11] I'm curious though if we can just use a smaller one since we already have the required drivers etc [12:31:43] Maybe that should be a layer on the WMF Docker Registry, so it can be fetched in one go (and no install scripts need to run [12:32:53] you mean building a custom base image and start from there? [12:34:20] yeah, basically. [12:34:37] At least with the components that will rarely change [12:35:29] I agree, that's a good idea. for the moment we use the same torch + rocm version so it is doable (2.0.1 and 5.4.2 respectively) [12:56:28] I tried to build one locally but failed as well [12:57:47] can I help? what was the issue when building locally? [12:57:48] https://phabricator.wikimedia.org/P56759 weird there are some nvidia stuff [12:57:54] I just saw my build failed [12:59:42] maybe we also need to change pyproject.toml in knowledge integrity? ~"~ https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/v0.3.0/pyproject.toml?ref_type=tags#L21 [13:00:13] hmm the sha mismatch. I've encountered this in the past a few times but I can't recall what I was doing [13:01:15] a it is using 1.13... We'll need to find a way to facilitate both versions (cpu and gpu). You're right though we should do sth in pyproject.toml. lemme check [13:23:22] I see in https://download.pytorch.org/whl/rocm5.4.2/torch/ that there is no package available for our architecture(x86_x64) and pytorch 1.13.1 with rocm... [13:24:14] we need either a x86_x64 version or a "manylinux" version, if I am not mistaken. klausman: am I right? [13:24:56] No, aarch64 is ARM64 [13:25:29] The topmost ones, e.g, torch-2.0.1+rocm5.4.2-cp39-cp39-linux_x86_64.whl are the right arch [13:26:26] cp39 stands for cPython 3.9 (cPython being the standard Python implementation, written in C) [13:27:09] In that list, I don't see a 1.13.1 for Linux x86_64 [13:27:19] ack! I mentioned linux_x86_x64 nor aarch64 [13:27:37] my question is: also the manylinux distros work right? [13:27:38] I don't see a "manylinux" for x86_64 [13:28:55] a just read only if they are for x86_x64 e.g. manylinux2014_x86_64 [13:29:30] but yes in this case we don't have an option for rocm 5.4.2 and torch 1.13.1 [13:30:13] aiko: we could try bumping torch to 2.0.1 in ki repo. wdyt? [13:37:37] I find the absence of a whl for linux_x86_64 of 1.13.1 a bit odd [13:38:57] There _is_ one for Rocm5.4 [13:39:01] https://download.pytorch.org/whl/rocm5.5/torch-2.1.2%2Brocm5.5-cp311-cp311-linux_x86_64.whl [13:39:06] er 5.5 [13:39:33] Also for 5.3 [14:30:34] hey all [14:30:42] Hey Chris [14:33:26] Hola! [14:35:31] isaranto: yeah we could try that! if there is no option for rocm 5.4.2 and torch 1.13.1 [14:35:45] hi Chris [14:42:36] I will also take a look at getting a newer rocm into WMFs apt repository. I think the newest might be a bit too much (because it has other deps), but we'll see [17:04:14] (03PS1) 10Ilias Sarantopoulos: llm: enable quantization with AutoGPTQ [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1003488 (https://phabricator.wikimedia.org/T354870) [17:20:18] logging off folks o/ [17:20:33] have a nice evening/rest of your day [18:57:21] 10Machine-Learning-Team, 10Goal: Goal: Implement caching for revertrisk-multilingual - https://phabricator.wikimedia.org/T353333 (10calbon) Possibly change to LA-revertrisk [19:20:22] night ilias!