[04:58:40] good morning folks [05:37:32] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11305185 (10kevinbazira) * Ran tone-check training job locally with model-ready training data to determine memory usage as there was an 8GB limit in wmf airflow that ca... [06:56:22] good morning [07:08:16] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11305262 (10elukey) Updates: * The two hosts, `ml-serve101[2,3]`, run now on Debian Trixie and the GPUs seem recognized and ususable. * They currently run amd-smi from ROCm 7.0.2, and we ar... [07:19:16] hello! [07:31:41] 06Machine-Learning-Team, 10Wikimedia-GitHub: Add Dawid Pogorzelski to WMF GitHub organization - https://phabricator.wikimedia.org/T407839#11305288 (10DPogorzelski-WMF) Invite accepted, 2fa has always been on :) [07:39:46] mornin [07:41:46] 06Machine-Learning-Team, 10Wikimedia-GitHub: Add Dawid Pogorzelski to WMF GitHub organization - https://phabricator.wikimedia.org/T407839#11305295 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos Thanks Sam! Resolving this then. [08:01:54] 06Machine-Learning-Team, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#11305323 (10Gehel) [08:07:24] shoutout to elukey for --^ Great work, this will help a ton especially with partitioning [08:09:44] thanks! [08:11:51] (03PS22) 10Ozge: outlink-topic-model: Introduce caching mechanism. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1176448 (https://phabricator.wikimedia.org/T356256) (owner: 10Bartosz Wójtowicz) [09:25:34] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11305547 (10BWojtowicz-WMF) Thank you for helping and sharing all the logs! I've pruned docker on my Mac machine, but I'm still having trouble reproducing... [09:55:13] (03CR) 10Ozge: [C:03+1] "Looks awesome! +1 with some minor suggestions." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1176448 (https://phabricator.wikimedia.org/T356256) (owner: 10Bartosz Wójtowicz) [09:58:24] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11305618 (10elukey) Tried to add taints via https://gerrit.wikimedia.org/r/1198470 but IIRC this doesn't work after the kubelet has been registered to the k8s api. I executed the following man... [09:58:54] dpogorzelski, klausman --^ as FYI, I manually tainted ml-serve1012 (running the new mi300x gpus) manually [09:59:22] so in theory the node can now be uncordoned, and only pods with those tollerations will be scheduled [09:59:32] that should be handy for the first tests [10:01:13] isaranto: not sure what kind of plans you and the team have for the nodes, but we can definitely start testing them [10:05:10] sounds awesome! We can definitely test them with models that are larger in size but we do need to have a pytorch/vllm image to be able to do any kind of benchmarking that would make sense [10:05:32] klausman: any news with https://phabricator.wikimedia.org/T394778? [10:06:15] at this point we just need to be able to have 1 image to be able to test and we can see what the best process of updating and maintaining these images are going to be [10:08:34] is there a way to test the image construction process locally? [10:09:03] would like to start fiddling with the pipeline and get accustomed with the steps [10:13:48] dpogorzelski: there are two main ways to do it - 1) blubber, that is a yaml config to add to a certain repository to basically build the related docker image. If you have checked out the inference-services repo, there are plenty of configs in there. The idea is to have a standard tool that takes care of the scaffolding/best-security-pratices/etc.. for you [10:14:55] 2) there is a repo called "production-images", that is more sre-controlled, in which we have more "freeform" docker images. Increased flexibility etc.. for use cases where blubber may be limiting (control plane images, special controllers, etc..) [10:15:10] https://gerrit.wikimedia.org/r/admin/repos/operations/docker-images/production-images,general [10:15:37] the repo is build on the build2xxx hosts manually via a tool called docker-pkg, that we have build in house (available via pypi) [10:15:58] it is like packaging docker images with debian-like interfaces (changelogs, control, etc..) [10:16:34] most of the ML images are built via blubber, but we have some common base images that we built via production-images/docker-pkg [10:16:45] like the ones for torch/amd [10:16:47] understood [10:16:51] can i execute the build steps on my laptop as well? [10:17:01] to have a quick local dev iteration process [10:17:03] you can run docker-pkg yes [10:17:24] I usually just create a venv and use it via pypi [10:18:42] context about https://phabricator.wikimedia.org/T394778 - the ml team needs to build huge images and test them over gpus, something that the SRE build nodes cannot support. So we are thinking about having one of the ml-lab hosts to become a ml-build node, with all the scaffolding for docker etc.. [10:19:24] how would i attempt to build an ML image locally for example? [10:19:52] i imagine production-images has only base images to construct the ml images, or? [10:20:05] yep yep [10:20:17] docker-pkg -c config.yaml build images/ --select '*batman*' [10:20:24] within the repo should work [10:21:03] now to complete the picture, because there is another important bit - we use docker distribution 2.8 as our internal private repo stack, backed by openstack swift for the binary blobs [10:21:50] it sadly supports only docker layer sizes of max 5GB compressed (we have a smaller limit set in nginx, around 4GBs and something) [10:22:39] now 5G compressed is a lot, but with vllm for example we had a very hard time to keep it under the limits. Kevin did a great job in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891 [10:22:52] but in the future there may be other issues [10:23:26] the SRE team is experimenting with ceph/s3 as backend for docker distribution, so we may get around the issue, but nothing is available atm [10:25:26] kk [10:27:15] https://www.irccloud.com/pastebin/JOkhoQfC/ [10:27:30] what am I doing wrong here ? :) [10:27:50] (i'm on a mac btw) [10:28:16] and you PWD is the production-images repo right? [10:28:23] correct [10:28:38] installed docker-pkg and setuputils in a venv [10:28:52] sorry setuptools [10:29:35] any chance that you could use py312 or py311 to test? I am wondering if this was ever tested for 3.14 [10:29:50] sure, i'll grab some food and try in a moment [10:29:58] yeah me too, ttl! [11:31:29] so with 3.12 it fails differently: [11:31:29] File "/Users/dawid/.pyenv/versions/3.12.12/lib/python3.12/site-packages/urllib3/connectionpool.py", line 716, in urlopen [11:31:29] httplib_response = self._make_request [11:31:29] FileNotFoundError: [Errno 2] No such file or directory [11:31:45] how is "batman" matched? [11:32:06] and where is this request going? :P [11:32:34] aha wait [11:32:41] that is a call to the local docker daemon [11:32:43] i guess [11:33:45] fixing [11:35:36] as a side note wonder if building would work on https://github.com/apple/container [11:35:49] but i guess not out of the box [11:38:58] ok looks much better now [11:39:26] `== Build done! ==` [11:39:34] so what did I just build? [11:39:39] it was pretty fast [11:51:53] so "batmat" should pattern-match to a name of a valid docker image :D [11:51:59] *batman [11:52:32] you can try to cherry pick https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891 and use "vllm" instead [11:53:29] could I just run docker build using that tempate? [11:53:33] could I just run docker build using that template? [11:54:51] i know this wouldn't execute other steps but would that be sufficient to just build the container? [11:56:20] aha probably not, i do see templating in there [11:56:24] yeah exactly [12:17:34] it seems that all templates under images/ are processed but nothing is actually build base on the build log info, hmmm [12:18:37] i did pick that change into a local branch of mine [12:19:59] there is a log file that gets created, some info could be there [12:20:04] aha matching a bit wider via *vllm* seems to yield better result [12:20:31] https://www.irccloud.com/pastebin/7Ptwwau4/ [12:26:08] perfect yes, forgot the * sorry! [12:29:01] so yeah there is probably some upgrades to do to support more recent python versions [12:29:36] i will try again afterwards with. 3.14, chances are there was a problem on my end [12:31:02] not sure, it seemed to be a horror with argparse :D [12:31:29] is it possible print build log to stdout while building? [12:31:52] i guess i can just tail the log file :P [12:42:28] for some reason, at least in my case, the build process seems to take quite some. time to fetch all the deb pkgs [12:43:10] how is it usually on build machines? [12:57:15] are there arm based base images as well? [13:00:18] nope, we have an experimental build arm node but the project hasn't started yet [15:02:42] 06Machine-Learning-Team, 13Patch-For-Review: Export retrained Tone-check model to an S3 bucket - https://phabricator.wikimedia.org/T406217#11306562 (10gkyziridis) ===Update=== After some discussions with people from the DE team, I am pasting here some ideas and good practices which answer the above comments.... [15:19:45] 06Machine-Learning-Team, 06Product Safety and Integrity, 06Research, 10Temporary accounts: Implement support for temporary accounts in revertrisk models - https://phabricator.wikimedia.org/T376116#11306714 (10OKryva-WMF) [15:20:09] 06Machine-Learning-Team, 10ORES, 06Product Safety and Integrity, 10Temporary accounts: RecentChanges with "Very Likely bad faith" ORES filter don't show Temporary accounts' edits - https://phabricator.wikimedia.org/T398066#11306718 (10OKryva-WMF) [18:39:35] 06Machine-Learning-Team, 10Recommendation-API, 06serviceops-radar: Caching service request for recommendation api - https://phabricator.wikimedia.org/T381438#11307867 (10SBisson) [18:39:58] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11307866 (10Eevans) I've created a draft merge-request here: https://gitlab.wiki... [18:46:51] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11307898 (10SBisson) 05Open→03Resolved a:03SBisson I don't think we can do anything about it at this point.