[06:47:38] hello! [07:06:15] Hello, good morning [07:13:02] good morning! [07:42:50] hey folks [07:51:24] georgekyz: o/ I am looking at docker + ores extension at the moment (finally :D) [07:51:53] I will update the guide once I have it working [07:58:48] hey folks! [07:59:22] During the next deployments you may notice some changes to the python-webapp chart related to mesh etc.. (mostly non kserve/isvc services, like ores-legacy etc..) [07:59:29] feel free to deploy [07:59:48] (changes tested etc.. we are rolling them out slowly) [08:00:10] \o ty! we will deploy in staging to test it so we'll let you know if we face any issues [08:02:07] yeah but no hurry [08:02:19] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10825154 (10KartikMistry) @kevinbazira @elukey Thanks a lot for help and all work on this! [08:19:52] isaranto: Thabx Ilias for your time on this, let me know for any updates or maybe to catch up in a call. [08:20:30] 🙌 [08:20:43] Regarding edit-check, I think there is an issue on the version of the torch and rocm drivers. Here is an updte: https://phabricator.wikimedia.org/T393154#10823214 [08:21:47] interesting! [09:08:14] georgekyz: nice finding! not sure if you saw in the chat during the team meeting yesterday, it seems peacock model was trained using torch 2.2.0+rocm5.6 (as shown in the notebook: https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/blob/ait/eval-datasets/notebooks/baseline-exp/binary_classification_lm.ipynb) [09:09:55] so there is something different between torch 2.5+rocm and previous versions [09:20:35] shall we deploy it with an image based on torch 2.3.0 then? Just to see if these issues go away and then we can decide if we need the gpu in prod or not [09:28:19] aiko: isaranto : Using the `amd-pytorch23:2.3.0rocm6.0-3-20250511` image I got the correct results but also this warning: `/home/venv/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py:440: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.) [09:28:19] attn_output = torch.nn.functional.scaled_dot_product_attention(` [09:28:38] https://www.irccloud.com/pastebin/ptvpUOVn/ [09:31:45] I think the best idea is to create an image using `torch == 2.4.1+rocm6.1` which is the one that I tested initially on ml-lab1: https://phabricator.wikimedia.org/T393154#10786444 . [09:31:45] This combination of torch and rocm is the only one that we do not have it as an image in docker registry, but we have a ready environment in ml-lab1 using the path: "/srv/pytorch-rocm/venv/lib/python3.11/site-packages/" [09:40:46] tbh that warning doesnt give much info. so not great not terrible .. :) [09:41:49] using torch 2.4.1 makes sense since we know it works properly. An alternative and more future proof option would also be to retrain the model using torch 2.5.1 or newer [09:43:44] just deployed ores-legacy to staging with the changes in the python-webapp chart that Luca mentioned earler -- all good, going to do the same with prod [09:44:10] Niiiice thnx Ilias! [09:44:51] Alright should we build the edit-check based on `amd-pytorch23:2.3.0rocm6.0-3-20250511` and ship it for staging/prod ? [09:45:47] Or create an image for `torch == 2.4.1+rocm6.1` and then use that one? (the former way is faster asap, the latter needs more steps) [09:46:09] georgekyz: +1 for trying amd-pytorch23:2.3.0rocm6.0-3-20250511. It seems like the quicker option to validate if the bug goes away and we can reevaluate next steps [09:46:25] alright perfect [09:46:38] need a ticket for that one? Or just update the current ticket ? [09:47:28] I'd use the same one -> https://phabricator.wikimedia.org/T393154 [09:48:28] (03PS1) 10Bartosz Wójtowicz: inference-services: Test if CI works for outlink. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146533 [09:54:33] (03Abandoned) 10Bartosz Wójtowicz: inference-services: Test if CI works for outlink. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146533 (owner: 10Bartosz Wójtowicz) [09:56:33] (03PS1) 10Gkyziridis: edit-check: Use older based image torch2.3 + rocm6.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146536 (https://phabricator.wikimedia.org/T393154) [10:01:32] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10825486 (10OKarakaya-WMF) # Tasks for Investigation - Current plan (will be updated as we figure out more or have more items/results) - Pick a subs... [10:01:55] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10825488 (10OKarakaya-WMF) # Current Limitations (will be updated as we figure out more.) - As I understand we can’t recommend articles to link whic... [10:06:05] 06Machine-Learning-Team, 07Documentation: [Fix]: Documentation for ORES and MediaWiki Docker - https://phabricator.wikimedia.org/T393876#10825508 (10isarantopoulos) @gkyziridis in the setup that you share above it seems that the database name is configured as my_wiki and then this is passed in the url configur... [10:06:24] Folks patch review for edit-check ready: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1146536 . Review when you have time please [10:06:54] (03CR) 10AikoChou: [C:03+1] edit-check: Use older based image torch2.3 + rocm6.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146536 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [10:07:51] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM,thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146536 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [10:09:20] isaranto: Thnx so much for updating the Documentation for ORSES! Much appreciated! [10:09:31] (03CR) 10Gkyziridis: [C:03+2] edit-check: Use older based image torch2.3 + rocm6.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1146536 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [10:14:38] Happy to help ! [10:52:26] aiko: Please have a look on this in order to deploy it today and test it on staging: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1146548 [10:59:15] on it! [11:01:14] aiko: thank you soooooooo much [11:28:50] (03PS6) 10Bartosz Wójtowicz: inference-services: Upgrade pycommit setup. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) [11:54:50] 06Machine-Learning-Team, 10Editing-team (Tracking): Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10825943 (10gkyziridis) == Edit-check Deployed on experimental Staging Using an older version of `pytorch` and `rocm` : [[ https://docker-registry... [11:55:36] Edit-check consistent results finaly: https://phabricator.wikimedia.org/T393154 [11:55:58] \o/ awesome, great work team! [11:56:58] 🥳 [12:00:18] https://meet.google.com/jfe-bojh-enw?authuser=0 [12:01:13] me and georgekyz are jumping in here to check the ores extension docker config [12:01:21] *docs not config [13:20:31] 🎉 [13:32:48] (03CR) 10Nik Gkountas: [C:03+2] Popular/search recommander: use domain code in lllang parameter (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [13:34:18] (03CR) 10CI reject: [V:04-1] Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [13:38:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10826267 (10DMburugu) [13:38:16] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10826270 (10DMburugu) [13:40:21] 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10826277 (10BWojtowicz-WMF) This task focuses on simplifying our pre-commit setup within inference-services repo. The plan is to: 1. Remove `isort`, `black` a... [13:42:34] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10826281 (10kevinbazira) ROCm-enabled PyTorch dependencies like `hipblaslt` (~10GB) and `rocblas` (~3.5GB) are the primary contributors to the largest layer in the wmf-debian... [13:43:25] elukey: isaranto: o/ I split the `venv` layer into smaller chunks and copied them separately into the runtime image. the largest layer is now ~2.61GB (compressed): https://phabricator.wikimedia.org/T385173#10826281 [13:43:25] whenever you get a minute, please let me know whether we'll be able to proceed with these sizes on the wikimedia docker registry. thanks! [13:48:43] oh yes now we can! [13:51:47] 06Machine-Learning-Team, 06Data-Engineering, 07Essential-Work: Make the revert risk predictions datasets available for analysis - https://phabricator.wikimedia.org/T388453#10826318 (10JAllemandou) This has been done by @fkaelin. The dataset is maintained by a research-team pipeline, it should only be radar f... [13:58:11] (03CR) 10Sbisson: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [14:01:25] sry be there in 2' [14:10:18] kevin this is great! \o/ [14:10:56] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails - https://phabricator.wikimedia.org/T375280#10826538 (10Kgraessle) [14:11:13] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails - https://phabricator.wikimedia.org/T375280#10826540 (10Kgraessle) a:03Kgraessle [14:11:20] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails - https://phabricator.wikimedia.org/T375280#10826541 (10Kgraessle) p:05Triage→03High [14:13:54] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10826555 (10jsn.sherman) >>! In T391103#10806815, @Ladsgroup wrote: > I made a comment on the patch. I'll... [14:19:25] kevinbazira, isaranto - ideally the last step will be to add it to the production-images repo, as we did for the pytorch one, and hopefully we'll also be able to build it on build2001 (without requiring ml-lablXXXX) [14:19:29] what do you think? [14:20:44] okok I'll proceed with that [14:22:12] kevinbazira: not sure if you ever used docker-pkg, but it is the tool that we use to build images in production-images [14:22:25] it is available in pip so easy to use [14:23:13] thanks! I'll dig into docker-pkg and let you know in case of any issues :) [14:23:42] say that your image name contains 'vllm', an invocation could be `docker-pkg -c config.yaml build images/ --select '*vllm*'` (from the production-images repo) [14:25:47] for now, I'll proceed to merge the MR: https://gitlab.wikimedia.org/repos/machine-learning/wmf-debian-vllm/-/merge_requests/2 [14:25:48] if it's ok with you and Ilias [14:55:12] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10826849 (10klausman) [14:55:57] yes no strong objections, I have no idea what the repo is about, it doesn't seem to trigger a blubber rebuild so ok for me [14:59:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10826857 (10jsn.sherman) [15:02:16] yes and yes! (merge the MR and then add the image to production images) [15:42:00] (03CR) 10CI reject: [V:04-1] PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [15:45:52] (03PS4) 10Kgraessle: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [15:51:33] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10827139 (10RobH) [15:53:15] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827152 (10RobH) [15:53:38] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827161 (10RobH) [15:58:26] (03CR) 10CI reject: [V:04-1] PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [16:59:10] (03PS5) 10Kgraessle: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [16:59:30] (03CR) 10Kgraessle: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1074472 (https://phabricator.wikimedia.org/T375280) (owner: 10Jsn.sherman) [17:16:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10827530 (10A_smart_kitten) [17:18:01] (03CR) 10Nik Gkountas: [V:03+2 C:03+2] Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [17:19:33] (03Merged) 10jenkins-bot: Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [18:32:51] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails - https://phabricator.wikimedia.org/T375280#10827897 (10Kgraessle) @isarantopoulos and @gkyz... [18:57:14] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10827957 (10Kgraessle) [18:58:08] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10827958 (10Kgraessle) >>! In T392148#10822223, @gkyzir... [18:59:59] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10827961 (10Kgraessle) [19:20:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06DBA, 10MediaWiki-Recent-changes, and 2 others: DBA Review of Tables that ORES Extension will create - https://phabricator.wikimedia.org/T391103#10827997 (10Kgraessle) [19:55:09] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 10Wikimedia-Extension-setup, and 2 others: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171#10828105 (10Kgraessle) It looks like we're missing the following i18n messages for idwiki: - ores-rcfilte... [20:00:26] 06Machine-Learning-Team, 10Moderator-Tools-Team (Kanban): Insure all ORES i18n messages are available for idwiki - https://phabricator.wikimedia.org/T394455 (10Kgraessle) 03NEW