[06:40:28] good morning folks [06:48:00] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161126 (10santhosh) The model repository setup is easy with the new versions. OVMS can pull models from huggingface repos and setup all configurations. I wrote a shellsc... [06:48:57] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161134 (10santhosh) > https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1153988 This patch to prepare a WMF production image is outdated agai... [06:49:12] good morning [06:57:03] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161152 (10santhosh) Since LiftWing hosting plan is pending and under consideration by #machine-learning-team , I setup https://ovms.wmcloud.org/ with upstream docker imag... [07:18:00] hello! [07:42:32] hey folks morning [07:43:04] re-asking the question - is there a pod in staging that uses a gpu? I'd like to kill it to see if the new version of the amd device plugin works [07:43:12] (the daemon that exposes the GPU to the kubelet) [07:46:04] morning! elukey: I remember currently no pod using a gpu [07:46:10] hey! iirc no there is not, but we would like to add one now [07:46:48] we decided to use a GPU for edit-check until we conclude the CPU deployment investigation [07:47:10] bartosz: ^^ [07:47:15] I'm opening a patch [07:47:26] sgtm! [07:48:04] super thanks :) [07:48:39] elukey: would a new deployment work or do you want to see sth else? I'll open the patch in a bit (I'm finishing sth) [07:50:01] isaranto: it should work, both ml-staging2001 and 2003 have the new plugin, and in the logs I see that a GPU is recognized [07:50:24] there are also new logs about finding memory/computer partitions (failing of course in staging), that look promising [07:52:18] (need to go to the dentist, back in an hour) [07:56:06] ok, we'll have it ready! [07:56:30] good luck with the dentist -- hope it is nothing serious <3 [08:07:27] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161317 (10isarantopoulos) After syncing with the Editing team on the latency SLO and we decided to enable a GPU until we investigate the issue. I'm going to open a patch to deploy and test on stag... [08:07:53] (03CR) 10Nik Gkountas: Add lead section size to article recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [08:08:13] (03PS4) 10Nik Gkountas: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 [08:08:55] here it is! Can someone review plz? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186431 [08:12:01] +1! [08:14:30] +1 from me as well! [08:21:41] 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161388 (10isarantopoulos) > The latest tests configured to be close to reality. Texts between 1k-2k characters num_words = random.randint(40, 56) @gkyziridis I still see `n... [08:21:42] Danke schön! [08:39:22] deployed! [09:01:37] (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: Update locust tests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1183684 (https://phabricator.wikimedia.org/T400460) (owner: 10Gkyziridis) [09:13:05] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11161603 (10achou) Data collection is complete and [[ https://docs.google.com/sp... [09:23:10] 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161635 (10isarantopoulos) I enabled the GPU in staging and used the following load test config to have texts between 1k-2k characters: ` def get_random_input_params(): nu... [09:23:59] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11161638 (10achou) >>! In T401968#11161603, @achou wrote: > - For cswiki, wh... [09:24:40] If you folks agree I can proceed and deploy the GPU change in prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186447 [09:24:59] I shared the load test results in the task https://phabricator.wikimedia.org/T403378#11161635 [09:39:10] isaranto: Thank you! So we're averaging 120ms response time on long inputs for the GPU-enabled edit-check? [09:39:44] root@deploy1003:~# kubectl exec edit-check-predictor-00010-deployment-8699cf6dc7-5xz98 -n edit-check -- ls /dev/dri [09:39:44] card1 [09:39:44] renderD128 [09:39:49] perfect :) [09:40:11] so the edit check pod in staging got scheduled on ml-staging2001 and the GPU was correctly exported [09:42:32] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11161707 (10elukey) The ML team deployed edit check in staging requiring a GPU, it got scheduled and I checked this: ` root@deploy1003:~# kubect... [09:42:44] updated --^ [09:43:03] so I think we can roll it out everywhere, it should support the MI300X gpus too [09:43:31] bartosz: yes. Avg is 127 and median is also 120ms [10:13:59] lovely Luca! [10:20:48] bartosz: if you are ok with the patch I'll proceed to deploy to prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186447 [10:20:50] cc:georgekyz [10:21:27] georgekyz: I have pushed an MR that refactors the tone-check retraining job logic (image and script) for DAG integration. [10:21:27] please review it whenever you get a minute: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/44 [10:21:27] thanks! [10:25:37] isaranto: Sounds good to me, +1 🙌 [10:26:20] Dziękuję! [10:41:00] deployed to codfw, proceeding to deploy to eqiad... [10:50:04] done! will keep an eye over the next hours and report back [11:03:27] isaranto: is it ok if I reboot ml-lab1001 or 1002? [11:04:24] totally fine by me, I don't know if anyone else is running anything on it at the moment. [11:04:33] if you reboot it we'll find out :P [11:05:44] kevinbazira: I approved the MR left a small comment. Thnx for working on that [11:07:27] thanks for the review! [11:08:31] I will test the DAG once the merged image has been built and pushed to the registry [11:10:48] 06Machine-Learning-Team: Article topic cache backfilling using article_topic hive table - https://phabricator.wikimedia.org/T403254#11162087 (10isarantopoulos) @Ottomata iirc you mentioned that you've already have an ETL setup that could do this. Is this the case? If yes, could you provide more info or maybe a l... [11:26:14] kevinbazira: 👍 [11:40:31] (03PS1) 10Gkyziridis: edit-check: Update locust tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1186482 (https://phabricator.wikimedia.org/T403378) [11:41:21] 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11162206 (10gkyziridis) Tests on experimental staging CPU. Patch is ready ` users = 60 spawn-rate = 1 run-time = 200s num_words = random.randint(40, 56) wait_time = between(0... [12:11:31] hello, I have two MRs for prod release and simplifying the staging release. More desc in the MRs. https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/43 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1656 Can you take a look when you have time? @kevinbazira [12:11:48] ack...looking [12:26:52] (03CR) 10Sbisson: [C:03+2] section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 (https://phabricator.wikimedia.org/T403976) (owner: 10Nik Gkountas) [12:27:34] (03Merged) 10jenkins-bot: section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 (https://phabricator.wikimedia.org/T403976) (owner: 10Nik Gkountas) [12:31:52] (03CR) 10Sbisson: [C:03+2] Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [12:32:28] (03Merged) 10jenkins-bot: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas) [12:45:02] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11162484 (10OKarakaya-WMF) [14:28:46] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11163077 (10OKarakaya-WMF) csv in the previous comment is also available [here](https://docs.google.com/spreadsheets/d/1gwneJ... [14:47:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:47:49] Deployment revertrisk-language-agnostic-predictor-default-00030-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [14:47:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-language-agnostic-predictor-default-00030-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:53:39] 06Machine-Learning-Team: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11163188 (10isarantopoulos) 05Open→03Resolved [14:53:57] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11163190 (10isarantopoulos) 05Open→03Resolved [14:55:26] 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11163198 (10isarantopoulos) a:05gkyziridis→03BWojtowicz-WMF [14:57:02] 06Machine-Learning-Team, 10Editing-team (Tracking): Incorporate Tone-check Retraining Notebook in ml-pipelines - https://phabricator.wikimedia.org/T401007#11163203 (10isarantopoulos) 05Open→03Resolved [14:57:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:57:49] Deployment revertrisk-language-agnostic-predictor-default-00030-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [14:57:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-language-agnostic-predictor-default-00030-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:01:28] ^ it seems we had 2 out of 5 replicas unavailable for ~30mins, will be looking if there was an underlying error causing them to crash [16:30:14] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11163883 (10achou) Last week I met with @Michael to discuss the g... [16:37:56] "Thanks to quantization that shrank the size of the MoE layers – reducing their precision to about 4.25 bits per parameter – the gpt‑oss‑120B can run efficiently on a single 80 GB GPU" [16:38:17] I was reading this and I wondered if nowadays a 80G GPU is considered something "easy" to run/buy [16:38:30] like a 1TB SSD :D [16:52:10] 06Machine-Learning-Team: Article topic cache backfilling using article_topic hive table - https://phabricator.wikimedia.org/T403254#11163974 (10Ottomata) Hmm, I think this? There is prior art in airflow-dags where people use the Spark Cassandra connector along with stored (and templated) SQL queries to insert d... [17:48:06] (03PS1) 10Nik Gkountas: add filtering based on lead section size for article suggestions [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1186558 (https://phabricator.wikimedia.org/T403730) [19:38:45] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11164941 (10Ottomata) Great stuff thank you Aiko! > the ML team... [22:55:38] (03PS2) 10Sbisson: add filtering based on lead section size for article suggestions [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1186558 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas)