[05:01:34] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Tracking), 07Epic: Expand language coverage for Tone Check - https://phabricator.wikimedia.org/T394448#11188004 (10ppelberg) [06:56:41] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11188124 (10kevinbazira) As shown in T404722#11185284, the job running without a dev-limit finally succeeded after >9.5hrs. Below are the results from the gene... [07:00:14] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11188134 (10BWojtowicz-WMF) @Dbrant > When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title paramete... [07:02:45] Guten tag o/ [07:13:01] good morning [07:13:17] hey folks! The SRE team took the decision to increase the nginx's tmpfs size to 12GB (was 4GB) on the docker registry nodes, so I guess that there will less problems from now on for ML images :) Please do not take this as a free pass to upload all sort of huge images to the registry, we'll need to do every time a review of the layers to reduce them as much as possible etc.. [07:13:36] \o/ [07:15:54] looking at the latest rocm/vllm images there isn't a big increase in the layer sizes so we would be ok. https://hub.docker.com/layers/rocm/vllm/rocm6.4.1_vllm_0.10.0_20250812/images/sha256-4c277ad39af3a8c9feac9b30bf78d439c74d9b4728e788a419d3f1d0c30cacaa [07:16:08] thanks for letting us know Luca! [07:16:55] kevinbazira: --^ this means that we'll likely won't have to break down big layers in multiple ones, unless absolutely needed, which makes porting the images much easier as well [07:17:21] good morning! [07:19:39] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11188170 (10isarantopoulos) @kevinbazira Alternatively we could stick with parquet which is more efficient in terms of storage, so we could adapt the data load... [07:21:05] isaranto: elukey: o/ that is very good news for the vLLM image :) [07:28:13] isaranto, kevinbazira - please note one thing: the registry nodes don't have a huge bandwidth available, so if you push super big layers they will have a cost when ml-serve will pull them for a deployment (even if we cache with Dragonfly etc..). If somebody will scap-deploy mediawiki at the same time, Wikikube will have to pull other images adding strain on the Registry as well. This may end up into a cascade of timeouts and sorrow, this is why I [07:28:13] am warning about the need to keep inspecting layers and trimming them where needed [07:29:34] ack! we'll continue working in the same way then [07:30:59] yep, we'll definitely continue trimming the layers. [07:31:22] <3 [08:38:15] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11188376 (10achou) Hi! I want to add more information regarding the data used for training. I was checking the [[ https://drive.google.com/drive/folders/1J9pSF... [10:22:23] bartosz: Hey mate, did you use the make file for testing the model `make articletopic-outlink` ? It seems that it has an issue with the `fasttext` import. [10:22:30] https://www.irccloud.com/pastebin/3hXtTwFm/ [10:25:35] (03CR) 10Gkyziridis: outlink-topic-model: Merge transformer and predictor pods. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [10:40:20] * isaranto afk - lunch [11:41:02] bartosz: it was my bad there was an older `my_venv` folder so it did not install correctly the requirements [11:41:36] (03CR) 10Gkyziridis: outlink-topic-model: Merge transformer and predictor pods. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [11:44:00] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11189039 (10klausman) >>! In T403697#11185767, @elukey wrote: > @klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also pl... [11:50:47] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11189077 (10klausman) a:03klausman [11:51:43] georgekyz: thank you a lot for the review and great catch with the missing async definition! I also stumbled with the same issue with `my_venv` existing and not installing new requirements :D [11:52:33] bartosz: Yeah you need to delete first otherwise the make file does not take care of that one in order to reinstall the requirements. [11:53:31] I am not sure about the comment on async, but I think we should make it async as well following the rest of the models [11:54:53] Yeah I think we're supposed to run the `make clean` to make sure the previous setup is properly cleaned [11:55:15] 👍 [11:55:25] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11189097 (10kevinbazira) >>! In T404722#11188170, @isarantopoulos wrote: > @kevinbazira Alternatively we could stick with parquet which is more efficient in te... [11:55:40] I am about to submit the new "changes in admin-ng" alert, so there should be some noise from that in the next 30-60 minutes. I'll take care of that. [11:55:56] georgekyz: And the lack of async is very interesting, I would have thought that we would have somehow catch it, but it seems that the kserve works just as well without the async 🤔 [11:56:40] (03CR) 10Bartosz Wójtowicz: outlink-topic-model: Merge transformer and predictor pods. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1187739 (https://phabricator.wikimedia.org/T404294) (owner: 10Bartosz Wójtowicz) [11:56:48] bartosz: "KServe’s model interface allows sync postprocess even when preprocess/predict are async." [11:58:34] to be honest... since we do not `await` something inside it, we could try to test it as synchronous. I am reading that async slightly increases overhead, so if we do not really need it we can keep it synchronous. We can experiment with that and then change it in the rest of the model servers [12:05:04] georgekyz: very interesting! can you share the source you're reading? We could progress with the patch as is since it's not changing any existing async/sync definitions and add a task to experiment with changing posprocess to sync to see the potential speedups in other models [12:25:58] since postprocess only formats the response I think it is totally fine as is. the main benefit for async is when we're working with I/O bound operations in a function (network, database etc) [13:10:50] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11189344 (10elukey) >>! In T403697#11189039, @klausman wrote: >>>! In T403697#11185767, @elukey wrote: >> @klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to... [13:16:38] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11189365 (10elukey) With the new settings, on ml-serve1012: ` elukey@ml-serve1012:~$ sudo /opt/rocm-6.4.3/bin/amd-smi set --memory-partition NPS4 ****** WARNING ******... [13:17:41] klausman: good news, with the new settings the GPU is partitionable and nothing explodes [13:18:03] nice! [13:18:10] ml-serve1012 has now 64 GPUS of ~24G of VRAM each [13:18:19] does nvtop work, too? [13:18:32] yep, it shows the 8 GPUs "only" [13:18:38] that I think makes sense [13:19:08] Yeah, I guess it's hw-focused and the partitioning might even be visible thorugh the APIs it uses [13:19:16] might +not [13:20:06] IIUC though when you reboot the partitioning is not preserved [13:20:15] that is a big bummer [13:20:53] also I am wondering if other configs are available, to have say 32 GPUs with 48G of vram etc.. [13:21:21] as fro the reboot, at worst we could make a one-shot systemd service that we deploy to specific machines [13:21:24] anyway, I suspect that using 6.16 from side on trixie was a bit extreme [13:25:12] The bleeding edge. And here I am running my GPU at home on forky :D [13:28:05] the "bad" news is that we cannot use bookworm's k8s 1.23 packages for ml-serve1012/1013 [14:03:41] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11189639 (10elukey) Next steps: 1) IIUC the GPU can work either in SPX mode (single partition for all cores and memory) or in NPS4/CPX mode (8 partitions of GPU compute e memory, 24GB of VRAM... [14:15:07] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11189678 (10elukey) >>! In T403697#11189639, @elukey wrote: > Next steps: > > 1) IIUC the GPU can work either in SPX mode (single partition for all cores and memory) or in NPS4/CPX mode (8 pa... [15:19:31] this is great news! [15:42:15] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11190327 (10Dbrant) >>! In T392833#11188134, @BWojtowicz-WMF wrote: > could we agree on using the `page_id` parameter for the requests do... [16:06:50] 06Machine-Learning-Team, 06Infrastructure-Foundations, 06serviceops: Migrate the ownership of Docker images in production-images repo to mailing lists - https://phabricator.wikimedia.org/T373526#11190507 (10elukey) 05Open→03Resolved [16:54:40] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11190742 (10achou) Here's my proposed schema: `sql CREATE TABLE table ( wiki... [17:11:15] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11190788 (10Ottomata) Nice! Very quick thoughts: - `wiki` - I prefer wiki_id.... [19:52:20] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: *INCOMPLETE* Request to host on Lift Wing - https://phabricator.wikimedia.org/T404911 (10FNavas-foundation) 03NEW [20:20:52] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: *INCOMPLETE* Request to host on Lift Wing - https://phabricator.wikimedia.org/T404911#11191526 (10FNavas-foundation) [20:23:21] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: *INCOMPLETE* Request to host on Lift Wing - https://phabricator.wikimedia.org/T404911#11191537 (10FNavas-foundation)