[06:42:45] Good morning! [07:55:33] o/ morning! [07:56:29] Guten tag Aiko o/ [07:56:38] :D [07:58:31] wie geht's [08:13:20] (03CR) 10Ilias Sarantopoulos: "The image is by 3GB bigger (largest layer is 14GB). gziped image is 2.79GB so we are still ok!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [08:21:42] Morning everyone! [08:21:58] Morning Tobias! [08:22:07] isaranto: es geht gut, soweit :) [08:23:06] haben sie die GPT-4o video gesehen? [08:24:01] Nein, ich hatte gestern keine Zeit [08:24:15] Ist es sehenswert? [08:25:23] (I'm starting to need google translate :P) [08:26:36] :D [08:26:41] So is it worth watching? [08:27:12] yes, at least the demo is! [08:27:35] Alrighty, I'll set some time aside today :) [08:29:13] this is an example -> https://www.youtube.com/watch?v=mzdvw_euKlk [08:30:42] 06Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317#9793440 (10hashar) [08:32:38] 06Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317#9793457 (10hashar) >>! In T338317#9623973, @dancy wrote: > We could configure buildkit gc rules for the Docker daemon: http... [08:37:13] isaranto: that is powerful, but still vert uncanny-valley :) [08:38:21] definitely! sparks numerous conversations and debates though [08:40:10] (03PS3) 10Ilias Sarantopoulos: llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) [08:45:08] whenever anyone has some time I'd like a review on the above [08:45:16] giving this another swing [08:51:58] (03PS1) 10AikoChou: outlink: move test_transformer to unit test directory [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031393 [08:53:00] on it [08:53:37] (03CR) 10Klausman: [C:03+1] llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [09:01:46] danke! [09:01:57] * isaranto afk early lunch + errand [09:02:06] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [09:02:51] (03Merged) 10jenkins-bot: llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031020 (https://phabricator.wikimedia.org/T362984) (owner: 10Ilias Sarantopoulos) [10:25:45] I'm testing the new torch/rocm version by editing the isvc on ml-staging (without making a patch for deployment-charts I mean). If it works I'll create a patch ofc [10:37:50] ack! [10:59:35] * isaranto sighs [10:59:37] no luck! [11:01:51] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9793979 (10isarantopoulos) Tested llm image for nllb-200 with pytorch 2.3.0 and rocm 6.0 and got the same errors: ` amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_init... [11:09:19] * klausman lunch and doc appt [11:46:53] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9794121 (10achou) Thanks for sharing the use case! > Potentially called on all edit attempts by not-yet-logged-in users. One thing to note is tha... [11:57:56] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9794170 (10kevinbazira) During the meeting between the Structured Content team and ML team, it was concluded that passing image objects is preferable to passing image... [12:10:39] isaranto: aiko: thank you for attending the meeting with the Structure Content team. One of the next steps from the meeting was to align on the format. I have shared a summary of our discussion and added some questions so that we can align on the format: https://phabricator.wikimedia.org/T363506#9794170 [12:13:41] o/ kevinbazira thanks for the summary! In the meeting we concluded that format is something we can figure out later. For now I'd suggest we can work on an example that accepts a base64 encoded image [12:15:14] I asked for the format so that it's on record in a ticket that the Structured Content team will share with us the format [12:16:10] sure sure I'll look into using base64 encoded image [12:18:54] yes you're right but they said they need to figure out how to get the image first etc [12:21:23] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9794241 (10isarantopoulos) We concluded that we will figure out the format after the team figures out the spike (accessing the image and sending a thumbnail to Lift W... [12:29:26] o/ hello folks [12:29:35] isaranto: so not even rocm 6.x works right? [12:30:19] o/ elukey . Nope! no luck in the rocm 6.0 department [12:31:36] okok so I believe it may not be ROCm then, or we are the first ones experiencing the issue, but it seems weird [13:09:38] heyo Luca! [13:54:44] 06Machine-Learning-Team: Airflow training pipeline - https://phabricator.wikimedia.org/T363554#9794724 (10achou) [14:06:41] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9794812 (10Trizek-WMF) [14:09:31] 06Machine-Learning-Team, 13Patch-For-Review: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428#9794839 (10klausman) All the machinery is now in place to make connections to Cassandra from isvcs on staging (in the experimental NS): ` m... [14:12:10] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9794843 (10Trizek-WMF) p:05Medium→03High [14:18:34] 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9794875 (10isarantopoulos) - The package is now available in test pypi and can be installed like this: ` pip install -i https://test.pypi.org/simple/ liftwing ` - re... [14:20:09] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9794882 (10klausman) Update: - Connections from isvc namespaces on staging to the Cassandra machines now work, including TLS certs and SNI - Next step: have an a... [14:21:53] 06Machine-Learning-Team, 06Structured-Data-Backlog: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9794890 (10mfossati) [14:23:09] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9794910 (10mfossati) >>! In T363506#9794241, @isarantopoulos wrote: > We concluded that we will figure out the format after the team figures out the spike (accessing... [14:32:59] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9794952 (10isarantopoulos) [14:41:03] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9794969 (10isarantopoulos) a:03kevinbazira [14:44:43] 06Machine-Learning-Team: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services - https://phabricator.wikimedia.org/T363725#9794982 (10isarantopoulos) [15:07:19] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass the maximum number of uploads to the logo detection service - https://phabricator.wikimedia.org/T363505#9795133 (10kevinbazira) Hi @mfossati, the ML team believes it would be more appropriate for the LiftWing API to set a maximum limit rather than receiv... [15:07:49] I've shared with Marco what we discussed in the meeting --^ [15:16:15] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9795165 (10AUgolnikova-WMF) [15:35:50] kevinbazira: great! [15:35:55] thank you [15:37:38] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9795271 (10elukey) Janis from ServiceOps suggested that maybe seccomp or apparmor are playing a role into this. ` jayme@ml-staging2001:~$ aa-exec -p docker-default -- /usr/bin/python3 -c "imp... [16:08:37] 06Machine-Learning-Team: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services - https://phabricator.wikimedia.org/T363725#9795482 (10isarantopoulos) As discussed in the team meeting this task will be restricted to providing a solution for the `revertrisk-language-agnostic` that c... [16:11:55] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9795497 (10isarantopoulos) Update: As part of the task {T362984} we have also experimented with different versions of pytorch (2.2.1, 2.3.0) and... [16:18:53] logging off for the day folks, cu tomorrow :) [16:19:03] o/ [16:19:40] \o [16:32:47] sooo I tried to run manually a docker container on the ml-staging2001 node [16:32:52] sudo docker run --rm -it --device /dev/dri/renderD128 --security-opt=no-new-privileges --entrypoint /bin/bash 8df6203550c2 [16:32:59] and os.access works.. [16:33:18] Hmmmm. [16:33:32] So the question is what happens differently with the kserve container. [16:34:01] the seccomp profile is added automatically by docker [16:37:09] tried also with --security-opt apparmor=docker-default but nope [16:38:52] (03PS1) 10AikoChou: revertrisk: add logic to accept revision data as input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 [16:40:57] (03PS2) 10AikoChou: revertrisk: add logic to accept revision data as input [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031512 [16:49:13] 06Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 14 (Growth Team)), 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9796173 (10DMburugu) [16:54:45] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9796219 (10elukey) Following an advice from Janis, I tried on ml-staging2001: ` sudo docker run --cap-drop all --rm -it --device /dev/dri/renderD128 --security-opt no-new-privileges --entrypo... [16:55:16] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9796236 (10elukey) Even better: ` sudo docker run --rm -it --device /dev/dri/renderD128 --security-opt=no-new-privileges --cap-drop ALL --entrypoint /usr/bin/python3 8df6203550c2 -c "import o... [16:56:38] going afk, brain fried again :D [16:56:41] have a nice rest of the day folks [16:57:19] \o have a nice evening [17:06:08] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9796275 (10JMeybohm) Two more data points that don't help at all: ` jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65... [18:11:18] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9796586 (10JMeybohm) You got me @elukey :-p For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails: ` jayme@ml-staging200...