[06:30:48] good morning [06:36:02] good morning! [06:59:25] good morning folks! [07:02:48] 06Machine-Learning-Team, 07Essential-Work: Upgrade remaining model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400144#11156140 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [07:12:41] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] fetch data from mwapi using revid instead of article title - https://phabricator.wikimedia.org/T371021#11156173 (10isarantopoulos) Pointing here another use case that would benefit if the model can be queried using revision id {T403029} [07:21:14] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11156208 (10OKarakaya-WMF) Benchmark completed (except for enwiki): taking micro_precision >= 0.75 micro_recall >= 0.2 as th... [07:35:40] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11156277 (10isarantopoulos) Great results! 🎉 @OKarakaya-WMF From the models/wikis already in production, are there any that... [08:40:19] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11156485 (10OKarakaya-WMF) I've picked the best scores and compared v1 (results from current prod) vs v2 (results from the ne... [09:09:33] hey folks! The new amd device plugin is ready to be tested in staging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1185865 [09:09:44] it should support the MI300X gpus [09:47:56] o/ elukey this sounds great! [09:48:40] klausman: given that we don't have any MI300s in staging what would a testing plan for the partitioning look like? [09:49:49] I think the best plan would be to do as much testing as we can before putting the machines in production, and if we ever need to re-test, we could cordon one. Access is of course a bit of a question since only SRE would have all the req'd permisiosn [09:51:29] ack [10:00:38] I think that we could grant access to ml-admins [10:00:59] (for ml-serve1012/ml-serve1013 [10:01:14] and it makes sense the folks may be able to run amd-smi if needed [10:01:21] so we could use sudoers for that.. [10:02:37] I put some tests for amd-smi in https://phabricator.wikimedia.org/T403697#11147141. Very very quick ones, but I am wondering if the version of amd-smi on Trixie is up-to-date enough [10:04:03] so there are multiple questions to work on: 1) what target os should we use? 2) What is the version of amd-smi that works for our we case? (namely, I expect to be able to do simple partitioning etc..) 3) How many partitions can we create, and what is the best for us? [10:04:16] the 3) may also end up to be a variety of configs [10:05:47] +1 on granting access to ml-team-admins [10:06:33] iirc we should be able to create 8 partitions (24GB each) -- at least according to the GPU specs [11:32:21] (03PS1) 10Nik Gkountas: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 [11:32:27] (03CR) 10CI reject: [V:04-1] Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [11:41:05] (03PS2) 10Nik Gkountas: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 [11:41:48] (03CR) 10CI reject: [V:04-1] Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [12:51:44] FIRING: LiftWingServiceErrorRate: ... [12:51:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:55:22] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11157321 (10gkyziridis) I synced with @BWojtowicz-WMF on the current status of this investigation. I am pasting the latest findings from last week. - **Tests:** - The latest tests configured... [12:56:44] RESOLVED: LiftWingServiceErrorRate: ... [12:56:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:10:35] (03PS1) 10Nik Gkountas: section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 [15:13:25] (03PS2) 10Nik Gkountas: section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 [15:15:11] (03PS3) 10Nik Gkountas: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 [15:20:45] Hello, I've created an MR in airflow-dags. We add more wikis to staging release dag. These wikis have passed the release threshold. Can you take a look when you have time? @kevinbazira https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1654 [15:21:34] I've just upgraded the amd k8s plugin on ml-staging2001 and ml-staging2003, is there a specific pod that requires a GPU that I can test? [15:23:11] (03PS3) 10Nik Gkountas: section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 (https://phabricator.wikimedia.org/T403976) [15:25:48] ozge_: ack... looking [15:59:45] I don't see any ml-staging deployment requiring a GPU, is it right? [16:16:22] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11158665 (10elukey) Deployed on ml-staging, everything looks good. Next steps: - Check on staging that scheduling pods with a GPU works as expec... [16:19:11] thank you @kevinbazira . The pipeline has finished. [16:25:40] great. +1 [16:26:54] 🙌 [17:26:48] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Revise-Tone-Structured-Task, and 2 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11159027 (10Ottomata) Quick note about event based solutions (we discussed this in... [17:28:35] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Revise-Tone-Structured-Task, and 2 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11159042 (10Ottomata) @achou FYI, I am prioritizing my time working on {T403660} an... [17:30:14] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11159049 (10Ottomata) @achou, @KStoller-WMF: Quick question: what... [19:58:11] (03CR) 10Sbisson: [C:04-1] Add lead section size to article recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [19:58:40] (03CR) 10Sbisson: [C:03+2] Remove format from ecsformatter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185867 (https://phabricator.wikimedia.org/T400562) (owner: 10Abijeet Patro) [19:59:22] (03Merged) 10jenkins-bot: Remove format from ecsformatter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185867 (https://phabricator.wikimedia.org/T400562) (owner: 10Abijeet Patro)