[08:50:04] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9766391 (10kevinbazira) Hi @elukey, following the recent switch from `api-ro` to `mw-api-int-ro` in T362316. If we wante... [09:40:51] 06Machine-Learning-Team: Have problem with migrating to LiftWing from ores - https://phabricator.wikimedia.org/T364089 (10AgnesAbah) 03NEW [09:46:33] Morning! [10:45:01] * klausman lunch [10:56:30] 06Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518#9766872 (10kostajh) [10:56:34] 06Machine-Learning-Team: Have problem with migrating to LiftWing from ores - https://phabricator.wikimedia.org/T364089#9766871 (10kostajh) [11:06:01] 06Machine-Learning-Team: Have problem with migrating to LiftWing from ores - https://phabricator.wikimedia.org/T364089#9766919 (10kostajh) The script referenced in the Google Sheets App uses the MediaWiki Action API. You'll want to use the `revisions` property in the query module, and request the `oresscores` pr... [12:33:00] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#9767337 (10elukey) Hi Kevin! You have two options: * You use the new "transparent proxy" config and you call directly `... [13:35:14] o/ [13:36:22] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9767646 (10elukey) Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th. [13:40:00] heya Luca [13:42:28] Good morning all [14:26:41] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9767900 (10elukey) ` elukey@stat1010:~$ dpkg -S rocm-smi rocm-smi-lib: /opt/rocm-5.4.0/bin/rocm-smi elukey@stat1010:~$ apt-cache show rocm-smi-lib | grep Depends Depends: python3, rocm-co... [14:31:12] elukey: I've meant to look into getting the relevant ata out of the kernel/device without the need for (too much) AMD code. Not sure yet how feasible it is, but I think the -core dep is mostly because Debian follows AMD's structuring of code/packages. [14:32:00] https://github.com/ROCm/amdsmi This is the main library [14:36:28] klausman: you pointed me to the right direction, https://packages.debian.org/bookworm/librocm-smi64-1 looks way better than the package that we currently use (that is from AMD itself, made for ubuntu) [14:36:52] so maybe we could use the package from Debian directly, even if it is only available from bookworm onwards [14:38:36] oh, yeah, that might be a good approach. After all, we don't really need all of the functionality the SMI library offers (tweaking power levels etc) [14:39:13] there is also https://packages.debian.org/bookworm/rocm-smi that is a drop-in replacement, and depends only on py3 [14:39:31] Adding the info to the task, maybe we can drop the packages when we upgrade to Bookworm [14:39:37] Ack! [14:40:13] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9767957 (10elukey) https://packages.debian.org/bookworm/rocm-smi https://packages.debian.org/source/bookworm/rocm-smi-lib The above are probably a good drop-in replacement, but they are av... [14:40:59] klausman: as a test I am thinking to do the following on ml-staging2001 - disable puppet, remove all the rocm packages, reboot and then test running a pod on the GPU [14:41:02] would it be ok? [14:41:12] Yeah, that sounds good. [14:42:42] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9767965 (10elukey) After a chat with Tobias, we are going to test this: * disable puppet on ml-staging2001 * remove all ROCm packages * reboot * test running a pod requiring a GPU and make... [15:29:48] klausman: if you have a moment https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1026952 [16:01:25] on it [16:02:21] LGTM! [16:03:15] danke [16:04:08] building the new image :) [16:06:19] Fingers crossed! [16:17:32] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9768454 (10elukey) This should be the diff between libdr bullseye (2.4.104) and bookworm (2.4.114) versions: https://salsa.debian.org/xorg-team/lib/libdrm/-/compare/libd... [16:37:54] aiko: congrats \o/ well deserved :) [16:40:24] :) [16:53:26] thank you luca <3 <3 <3 [16:55:44] going afk for today folks! Have a nice rest of the day and weekend! [16:56:05] the new pytorch image is still being published, hopefully it will be ready soon :) [16:57:24] o/ have a nice weekend! :) [16:57:26] \o [16:57:39] aiko: well done and congratulations from me, too [17:01:12] thank you Tobias! \o/ [17:01:41] I got a lot of help from the team <3 [18:10:55] (03CR) 10AikoChou: "Hi Kevin, thanks for working on this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [19:05:35] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9768972 (10elukey) ` == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1 ` [19:05:48] the new image is up and running :) [19:05:57] (pytorch 2.1 + rocm 5.7) [19:12:52] (03CR) 10AikoChou: "Is your "users" param in locust.conf setting to 2?https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-servic" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) (owner: 10Ilias Sarantopoulos) [19:23:01] elukey: nice! [19:46:06] (03CR) 10AikoChou: utils: slow function execution wrapper (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos)