[05:43:26] Good morning o/ [06:48:25] kserve 0.12.1 is out! [06:51:42] 06Machine-Learning-Team: Update revertrisk to kserve 0.12.1 - https://phabricator.wikimedia.org/T363127 (10isarantopoulos) 03NEW [06:53:31] (03PS1) 10Ilias Sarantopoulos: revertrisk: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023318 (https://phabricator.wikimedia.org/T363127) [06:54:19] (03PS2) 10Ilias Sarantopoulos: revertrisk: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023318 (https://phabricator.wikimedia.org/T363127) [06:55:34] 06Machine-Learning-Team: Update revertrisk multilingual to kserve 0.12.1 - https://phabricator.wikimedia.org/T363129 (10isarantopoulos) 03NEW [06:55:46] 06Machine-Learning-Team: Update revertrisk wikidata to kserve 0.12.1 - https://phabricator.wikimedia.org/T363130 (10isarantopoulos) 03NEW [06:57:19] (03PS3) 10Ilias Sarantopoulos: revertrisk: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023318 (https://phabricator.wikimedia.org/T363127) [07:02:50] (03PS1) 10Ilias Sarantopoulos: revertrisk-multilingual: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023319 (https://phabricator.wikimedia.org/T363129) [07:02:50] (03PS1) 10Ilias Sarantopoulos: revertrisk-wikidata: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023320 (https://phabricator.wikimedia.org/T363130) [08:11:33] Morning! [08:20:03] o/ Tobias! [08:20:13] (03CR) 10AikoChou: [C:03+1] revertrisk-multilingual: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023319 (https://phabricator.wikimedia.org/T363129) (owner: 10Ilias Sarantopoulos) [08:20:31] (03CR) 10AikoChou: [C:03+1] revertrisk: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023318 (https://phabricator.wikimedia.org/T363127) (owner: 10Ilias Sarantopoulos) [08:21:28] (03CR) 10AikoChou: [C:03+1] "Thankssss!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023320 (https://phabricator.wikimedia.org/T363130) (owner: 10Ilias Sarantopoulos) [08:22:18] good morning o/ [08:24:36] \o [08:25:48] Hope you had a relaxing long weekend :) [09:52:36] (03Merged) 10jenkins-bot: revertrisk: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023318 (https://phabricator.wikimedia.org/T363127) (owner: 10Ilias Sarantopoulos) [10:04:47] klausman: shall I create a new base image with pytorch using bullseye (or trixie) or is there some other way to do it ( I mean without creating a patch in production-images etc). [10:12:53] New base image is rpobably easiest, yea [10:24:59] * klausman lunch (and an errand) [10:25:12] will do that then. I'll create it as a newer version instead of a totally new image [11:52:58] isaranto: o/ are u working on revertrisk upgrade deployment? If not yet, I can work on it. I also need to update the image for batch model [11:54:56] ok, feel free to do it. I just merged the 1 patch for revertrisk-lagnuage agnostic. I'm going to merge the other 2 as well.ok? [11:55:30] ok no problem! [11:55:40] ack [11:55:51] (03PS2) 10Ilias Sarantopoulos: revertrisk-multilingual: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023319 (https://phabricator.wikimedia.org/T363129) [12:02:34] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9735203 (10isarantopoulos) Debian bookworm has a different version of the `libdrm-amdgpu1` package as we can see in the [[ https://packages.debian.org/search?searchon=names&keywords=libdrm-amd... [12:03:56] made an attempt for a new image. lemme know if this is ok. https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1023414 [12:04:00] * isaranto afk lunch [12:47:34] (03CR) 10Ilias Sarantopoulos: [C:03+2] revertrisk-multilingual: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023319 (https://phabricator.wikimedia.org/T363129) (owner: 10Ilias Sarantopoulos) [12:50:23] hi folks! [12:50:43] sorry I am starting a little later, some issues with the baby-management :D [12:53:41] (03Merged) 10jenkins-bot: revertrisk-multilingual: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023319 (https://phabricator.wikimedia.org/T363129) (owner: 10Ilias Sarantopoulos) [12:56:44] hey Luca \o [12:57:06] (03PS2) 10Ilias Sarantopoulos: revertrisk-wikidata: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023320 (https://phabricator.wikimedia.org/T363130) [12:58:16] o/ [12:58:57] isaranto: re https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1023414 - one thing that just came up to mind is that this is the first time that we don't explicitly install anything from ROCm upstream on the image, we just rely on the Pytorch's bundled .so libraries [12:59:36] hey Luca! [12:59:38] in theory a relatively modern linux kernel should have support for the GPU, but maybe the ROCm packages do some extra bits that we are missing?\ [13:02:31] what did we previously install that we don't do it now? I have the llm blubber image in mind which is more or less the same thing [13:02:45] but I agree that we may be missing something with bookworm [13:03:24] we don't install any deb pkg on those images [13:03:37] the ROCm ones I mean [13:03:57] in theory they are not needed, since most of those libraries are packaged in Pytorch [13:04:25] and we don't install it since it would be another +10G [13:04:36] ack [13:05:06] you're right got it! [13:09:24] Morning all [13:09:52] o/ [13:11:03] isaranto: mmm the packages in theory shouldn't be needed [13:11:21] morning Chris! [13:11:30] my bet is on the rocm drivers [13:12:47] I am wondering if we could simply create a pod with pytorch 2.2 on ml-staging, and run a simple entrypoint like Python trying to import torch and checking the GPU [13:13:51] we could even think about adding a new image to production images, with a pytorch specific check [13:17:34] u mean to test torch 2.2 using the amdpytorch image? [13:18:31] I think we need to test that one and then another one with bookworm+ pytorch+ rocm5.4.2 to see if that would work. The latter would be to figure out if everything is ok when using bookworm [13:18:58] tbh I'd expect that everything is fine with bookworm. Just trying to narrow down the issue to the drivers [13:19:49] sure but at the moment we don't have an image to run to test if a GPU works etc.. It will happen the same with the first MI210 that will arrive [13:20:09] we have a amd-gpu-test image IIRC, but it uses tensorflow and drivers etc.. [13:20:37] we could rework it to say use one of the pytorch base images, and run a simple python script that does XYZ [13:20:48] so in the future we'll just need to run it to check how things work [13:20:59] we could do the same manually creating a Pod instance in theory, should be fine as well [13:38:27] I'm going to take a look and report back [14:07:57] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9735741 (10klausman) [14:18:36] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9735803 (10calbon) Update: - Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the networ... [14:25:24] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9735836 (10calbon) - GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question i... [14:41:39] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9735872 (10klausman) [14:41:44] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9735873 (10klausman) [14:46:15] 06Machine-Learning-Team, 13Patch-For-Review: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428#9735880 (10klausman) [15:13:29] 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9735994 (10achou) > Thanks, that is what I am proposing as well. @achou, how feasible do you think this is from your side? It would... [15:20:37] 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9736068 (10kostajh) >>! In T356102#9735994, @achou wrote: >> Thanks, that is what I am proposing as well. @achou, how feasible do y... [15:31:03] As a heads up: I've just applied the two changes for Cassandra network policies in staging. It shouldn't affect any running services, but if something breaks, ping me [15:43:32] isaranto: o/ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1023460 [15:53:51] LGTM! [15:53:57] (03CR) 10Ilias Sarantopoulos: [C:03+2] revertrisk-wikidata: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023320 (https://phabricator.wikimedia.org/T363130) (owner: 10Ilias Sarantopoulos) [15:57:03] (03Merged) 10jenkins-bot: revertrisk-wikidata: upgrade to 0.12.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023320 (https://phabricator.wikimedia.org/T363130) (owner: 10Ilias Sarantopoulos) [16:08:25] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9736330 (10isarantopoulos) At the moment we have tried/used things in the following matrix. Success/fail refers to whether the GPU has been successfully been used with py... [16:09:29] I added the options to explore with the GPU according to our earlier discussion. Let me know if I missed something either here or on the task [16:09:43] going afk folks, have a nice evening/rest of day [16:10:04] \o [16:15:04] bye Ilias! thanks for the summary [16:39:19] heading out now as well o/ [16:48:25] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9736578 (10elukey) Quick clarification - there are currently two places where we use ROCm-specific libs: 1) On every stat node and k8s node with a GPU, we deploy the Deb... [16:49:42] \o [16:51:16] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191 (10elukey) 03NEW [16:51:35] isaranto: opened --^ to explain the ideas discussed during the meeting, lemme know tomorrow your thoughts [16:57:18] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9736667 (10elukey) The only issue that I see from puppet is that `prometheus::node_amd_rocm` uses rocm smi to get info about what GPU to monitor. [17:38:50] heading out o/ [17:47:57] night elukey [18:25:28] 06Machine-Learning-Team: Getting unsupported lang error for some wiki for revertrisk-language-agnostic calls - https://phabricator.wikimedia.org/T363203 (10prabhat) 03NEW [18:28:20] 06Machine-Learning-Team: Unsupported lang error for some wiki for revertrisk-language-agnostic calls - https://phabricator.wikimedia.org/T363203#9737056 (10prabhat) [20:26:56] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9737644 (10Isaac) hey all (not sure who exactly to tag but maybe I'll start with @kevinbazira just because I know you did a lot of good work on this) -- I'...