[06:40:28] <georgekyz>	 good morning folks
[06:48:00] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161126 (10santhosh) The model repository setup is easy with the new versions.  OVMS can pull models from huggingface repos and setup all configurations. I wrote a shellsc...
[06:48:57] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161134 (10santhosh) > https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1153988  This patch to prepare a WMF production image is outdated agai...
[06:49:12] <ozge_>	 good morning
[06:57:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Host an OpenVINO model in LiftWing - https://phabricator.wikimedia.org/T395012#11161152 (10santhosh) Since LiftWing hosting plan is pending and under consideration by #machine-learning-team , I setup https://ovms.wmcloud.org/ with upstream docker imag...
[07:18:00] <isaranto>	 hello!
[07:42:32] <elukey>	 hey folks morning
[07:43:04] <elukey>	 re-asking the question - is there a pod in staging that uses a gpu? I'd like to kill it to see if the new version of the amd device plugin works
[07:43:12] <elukey>	 (the daemon that exposes the GPU to the kubelet)
[07:46:04] <aiko>	 morning! elukey: I remember currently no pod using a gpu
[07:46:10] <isaranto>	 hey! iirc no there is not, but we would like to add one now
[07:46:48] <isaranto>	 we decided to use a GPU for edit-check until we conclude the CPU deployment investigation
[07:47:10] <isaranto>	 bartosz: ^^ 
[07:47:15] <isaranto>	 I'm opening a patch
[07:47:26] <aiko>	 sgtm!
[07:48:04] <elukey>	 super thanks :)
[07:48:39] <isaranto>	 elukey: would a new deployment work or do you want to see sth else? I'll open the patch in a bit (I'm finishing sth)
[07:50:01] <elukey>	 isaranto: it should work, both ml-staging2001 and 2003 have the new plugin, and in the logs I see that a GPU is recognized
[07:50:24] <elukey>	 there are also new logs about finding memory/computer partitions (failing of course in staging), that look promising
[07:52:18] <elukey>	 (need to go to the dentist, back in an hour)
[07:56:06] <isaranto>	 ok, we'll have it ready!
[07:56:30] <isaranto>	 good luck with the dentist -- hope it is nothing serious <3
[08:07:27] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161317 (10isarantopoulos) After syncing with the Editing team on the latency SLO and we decided to enable a GPU until we investigate the issue. I'm going to open a patch to deploy and test on stag...
[08:07:53] <wikibugs>	 (03CR) 10Nik Gkountas: Add lead section size to article recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas)
[08:08:13] <wikibugs>	 (03PS4) 10Nik Gkountas: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889
[08:08:55] <isaranto>	 here it is! Can someone review plz? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186431
[08:12:01] <aiko>	 +1!
[08:14:30] <bartosz>	 +1 from me as well! 
[08:21:41] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161388 (10isarantopoulos)  > The latest tests configured to be close to reality. Texts between 1k-2k characters num_words = random.randint(40, 56)  @gkyziridis I still see `n...
[08:21:42] <isaranto>	 Danke schön!
[08:39:22] <isaranto>	 deployed!
[09:01:37] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: Update locust tests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1183684 (https://phabricator.wikimedia.org/T400460) (owner: 10Gkyziridis)
[09:13:05] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11161603 (10achou) Data collection is complete and [[ https://docs.google.com/sp...
[09:23:10] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11161635 (10isarantopoulos) I enabled the GPU in staging and used the following load test config to have texts between 1k-2k characters: ` def get_random_input_params():     nu...
[09:23:59] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11161638 (10achou) >>! In T401968#11161603, @achou wrote: >     - For cswiki, wh...
[09:24:40] <isaranto>	 If you folks agree I can proceed and deploy the GPU change in prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186447
[09:24:59] <isaranto>	 I shared the load test results in the task https://phabricator.wikimedia.org/T403378#11161635
[09:39:10] <bartosz>	 isaranto: Thank you! So we're averaging 120ms response time on long inputs for the GPU-enabled edit-check? 
[09:39:44] <elukey>	 root@deploy1003:~# kubectl exec edit-check-predictor-00010-deployment-8699cf6dc7-5xz98 -n edit-check -- ls /dev/dri
[09:39:44] <elukey>	 card1
[09:39:44] <elukey>	 renderD128
[09:39:49] <elukey>	 perfect :)
[09:40:11] <elukey>	 so the edit check pod in staging got scheduled on ml-staging2001 and the GPU was correctly exported
[09:42:32] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11161707 (10elukey) The ML team deployed edit check in staging requiring a GPU, it got scheduled and I checked this:  ` root@deploy1003:~# kubect...
[09:42:44] <elukey>	 updated --^
[09:43:03] <elukey>	 so I think we can roll it out everywhere, it should support the MI300X gpus too
[09:43:31] <isaranto>	 bartosz: yes. Avg is 127 and median is also 120ms
[10:13:59] <isaranto>	 lovely Luca!
[10:20:48] <isaranto>	 bartosz: if you are ok with the patch I'll proceed to deploy to prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1186447
[10:20:50] <isaranto>	 cc:georgekyz 
[10:21:27] <kevinbazira>	 georgekyz: I have pushed an MR that refactors the tone-check retraining job logic (image and script) for DAG integration.
[10:21:27] <kevinbazira>	 please review it whenever you get a minute: https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/44
[10:21:27] <kevinbazira>	 thanks!
[10:25:37] <bartosz>	 isaranto: Sounds good to me, +1 🙌 
[10:26:20] <isaranto>	 Dziękuję!
[10:41:00] <isaranto>	 deployed to codfw, proceeding to deploy to eqiad...
[10:50:04] <isaranto>	 done! will keep an eye over the next hours and report back
[11:03:27] <elukey>	 isaranto: is it ok if I reboot ml-lab1001 or 1002?
[11:04:24] <isaranto>	 totally fine by me, I don't know if anyone else is running anything on it at the moment.
[11:04:33] <isaranto>	 if you reboot it we'll find out :P
[11:05:44] <georgekyz>	 kevinbazira: I approved the MR left a small comment. Thnx for working on that
[11:07:27] <kevinbazira>	 thanks for the review!
[11:08:31] <kevinbazira>	 I will test the DAG once the merged image has been built and pushed to the registry
[11:10:48] <wikibugs>	 06Machine-Learning-Team: Article topic cache backfilling using article_topic hive table - https://phabricator.wikimedia.org/T403254#11162087 (10isarantopoulos) @Ottomata iirc you mentioned that you've already have an ETL setup that could do this. Is this the case? If yes, could you provide more info or maybe a l...
[11:26:14] <georgekyz>	 kevinbazira: 👍
[11:40:31] <wikibugs>	 (03PS1) 10Gkyziridis: edit-check: Update locust tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1186482 (https://phabricator.wikimedia.org/T403378)
[11:41:21] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11162206 (10gkyziridis) Tests on experimental staging CPU. Patch is ready  ` users = 60 spawn-rate = 1 run-time = 200s  num_words = random.randint(40, 56) wait_time = between(0...
[12:11:31] <ozge_>	 hello, I have two MRs for prod release and simplifying the staging release. More desc in the MRs. https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/43 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1656  Can you take a look when you have time? @kevinbazira
[12:11:48] <kevinbazira>	 ack...looking
[12:26:52] <wikibugs>	 (03CR) 10Sbisson: [C:03+2] section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 (https://phabricator.wikimedia.org/T403976) (owner: 10Nik Gkountas)
[12:27:34] <wikibugs>	 (03Merged) 10jenkins-bot: section recommendations: filter out appendix sections from missing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185968 (https://phabricator.wikimedia.org/T403976) (owner: 10Nik Gkountas)
[12:31:52] <wikibugs>	 (03CR) 10Sbisson: [C:03+2] Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas)
[12:32:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add lead section size to article recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1185889 (owner: 10Nik Gkountas)
[12:45:02] <wikibugs>	 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11162484 (10OKarakaya-WMF)
[14:28:46] <wikibugs>	 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11163077 (10OKarakaya-WMF) csv in the previous comment is also available [here](https://docs.google.com/spreadsheets/d/1gwneJ...
[14:47:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[14:47:49] <jinxer-wm>	 Deployment revertrisk-language-agnostic-predictor-default-00030-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[14:47:49] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-language-agnostic-predictor-default-00030-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:53:39] <wikibugs>	 06Machine-Learning-Team: Kserve batcher doesn't seem to be properly configured for edit-check - https://phabricator.wikimedia.org/T403423#11163188 (10isarantopoulos) 05Open→03Resolved
[14:53:57] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11163190 (10isarantopoulos) 05Open→03Resolved
[14:55:26] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11163198 (10isarantopoulos) a:05gkyziridis→03BWojtowicz-WMF
[14:57:02] <wikibugs>	 06Machine-Learning-Team, 10Editing-team (Tracking): Incorporate Tone-check Retraining Notebook in ml-pipelines - https://phabricator.wikimedia.org/T401007#11163203 (10isarantopoulos) 05Open→03Resolved
[14:57:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[14:57:49] <jinxer-wm>	 Deployment revertrisk-language-agnostic-predictor-default-00030-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[14:57:49] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-language-agnostic-predictor-default-00030-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[15:01:28] <bartosz>	 ^ it seems we had 2 out of 5 replicas unavailable for ~30mins, will be looking if there was an underlying error causing them to crash 
[16:30:14] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11163883 (10achou) Last week I met with @Michael to discuss the g...
[16:37:56] <elukey>	 "Thanks to quantization that shrank the size of the MoE layers – reducing their precision to about 4.25 bits per parameter – the gpt‑oss‑120B can run efficiently on a single 80 GB GPU"
[16:38:17] <elukey>	 I was reading this and I wondered if nowadays a 80G GPU is considered something "easy" to run/buy
[16:38:30] <elukey>	 like a 1TB SSD :D
[16:52:10] <wikibugs>	 06Machine-Learning-Team: Article topic cache backfilling using article_topic hive table - https://phabricator.wikimedia.org/T403254#11163974 (10Ottomata) Hmm, I think this?  There is prior art in airflow-dags where people use the Spark Cassandra connector along with stored (and templated) SQL queries to insert d...
[17:48:06] <wikibugs>	 (03PS1) 10Nik Gkountas: add filtering based on lead section size for article suggestions [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1186558 (https://phabricator.wikimedia.org/T403730)
[19:38:45] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11164941 (10Ottomata) Great stuff thank you Aiko!  > the ML team...
[22:55:38] <wikibugs>	 (03PS2) 10Sbisson: add filtering based on lead section size for article suggestions [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1186558 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas)