[07:13:55] hello folks! [07:24:11] Mornin' [07:28:28] Good morning! [08:28:57] klausman: Janis gave me a super nice idea for the ROCm docker image size, namely just mount /opt/rocm-5.4.0 as read only mountpoint to the pod [08:29:20] so far I see it working, just need to fix some errors [08:29:24] it seems the most promising way [08:29:41] ah, so keep the driver stuff on the baremetal machine and mount it into pods? [08:32:29] yes exactl [08:57:27] It's still a _lot_ of software to lug around, but agreed, storing it on the host instead of bloating N docker images is quite preferable [09:06:14] we have to store it on the host anyway, so best to avoid copies etc.. [09:12:00] yep. plus, it's likely easier to figure out trimming bits there [09:18:10] 10Machine-Learning-Team, 10serviceops: docker-pkg fails to upload big Docker images to the registry - https://phabricator.wikimedia.org/T335177 (10elukey) @akosiaris thanks for the in depth answer! I figured it was something nginx-related but I didn't think to check the max upload size (TIL for the next time).... [09:22:25] 10Machine-Learning-Team, 10Patch-For-Review: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) I opened T335177 since the amd-gpu-test image size got up to 14G (uncompressed), and nginx on the docker registry nodes has a limit of 2G (compressed) so it eventual... [09:25:24] https://www.slideshare.net/theofpa/kubecon-2023-eu-kserve-the-state-and-future-of-cloudnative-model-serving [09:25:52] slide 16 mentions what isaranto suggested (use multiple models in the same pod to share gpus) [09:27:02] https://github.com/NVIDIA/FasterTransformer [09:29:49] It's hilarious that for benchmarking that, you still need `bc` :D [09:31:17] nice resource! [09:31:34] I think with recent advancements model mesh will become more popular/stable [09:31:41] I hope that they will also share the video [09:31:49] but again nvidia is way ahead [09:31:50] sigh [09:46:13] klausman: do you have time today/tomorrow to check the last two codfw network maintenance tasks and send meeting invites for the team? (so we are aware etc..) [09:52:44] Will do. I'm out this afternoon, but I should get to it before lunch [09:53:58] super thanks! [09:54:59] Only row C and D, right? [09:56:17] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:56:40] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:57:57] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:58:32] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10klausman) [09:58:52] (03CR) 10Elukey: events: add code to generate predicted_classification events (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/907923 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [09:58:58] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [09:59:08] klausman: correct yes, Arzhel sent an email some days ago [10:02:37] and done [10:04:53] looks good thanks! [10:28:07] * elukey lunch! [14:56:33] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Thai-Sites, 10User-notice: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10kevinbazira) [14:57:30] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Thai-Sites, 10User-notice: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10kevinbazira) @kostajh, we published datasets for all 17/19 models that passed the evaluation in this round. [15:07:29] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) One thing that needs to be taken care of is the following: ORES models have some calculated thresholds which correspond to specific statisti... [15:09:48] I updated the task about the ORES thresholds with what we discussed earlier --^ [15:10:43] +1 [15:37:46] 10Machine-Learning-Team, 10serviceops: docker-pkg fails to upload big Docker images to the registry - https://phabricator.wikimedia.org/T335177 (10elukey) 05Open→03Resolved a:03elukey [16:04:08] Just ran the Alex Net test on a DSE GPU \o/ [16:12:44] wehoo [16:15:41] waaa \o/ [16:21:17] 10Machine-Learning-Team: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) Finally I was able to run the alexnet tensorflow test on a DSE GPU: ` TensorFlow: 2.11 Model: alexnet Dataset: imagenet (synthetic) Mode: training SingleSess: False Ba... [16:21:27] (03CR) 10Jforrester: [C: 03+2] Update moved class WikiMap [extensions/ORES] - 10https://gerrit.wikimedia.org/r/911818 (https://phabricator.wikimedia.org/T321681) (owner: 10Gerrit maintenance bot) [16:27:17] (03Merged) 10jenkins-bot: Update moved class WikiMap [extensions/ORES] - 10https://gerrit.wikimedia.org/r/911818 (https://phabricator.wikimedia.org/T321681) (owner: 10Gerrit maintenance bot) [16:38:05] 10Machine-Learning-Team, 10Patch-For-Review: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) The task is done, we have successfully configured and run a job on a GPU on DSE! All the configs are also puppetized so we can apply the same to any Lift Wing node a... [16:40:13] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) The last issue has been fixed in T333009: for k8s nodes we just allow `others` to read the devices. The new ROCm suite has been imported for... [17:05:47] * elukey afk! [19:28:49] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BPirkle) The old image suggestions api ([[ https://gerrit.wikimedia.org/g/mediawiki/services/image-suggestion-api | mediawiki/servi...