[00:26:55] (03PS1) 10Sbisson: remove unused simple english domain map entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 [00:32:15] (03PS2) 10Sbisson: Remove unused simple english domain map entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 [03:17:05] (03CR) 10CI reject: [V:04-1] build: Updating mediawiki/mediawiki-codesniffer to 47.0.0 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1145388 (owner: 10Libraryupgrader) [06:49:26] Good morning [07:14:55] hello! [07:15:57] good morning! [07:31:00] good morning folks [07:36:12] (03CR) 10Nikerabbit: Remove unused simple english domain map entries (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 (owner: 10Sbisson) [07:40:43] Mornin' [07:57:44] Folks there was an update on helm-lint and the bug is fixed in the morning. I am merging this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1144521?checksRunsSelected=test,test-prio [07:58:26] yes, go ahead! [08:02:46] dunkje [09:48:15] Update on Edit-check: [09:48:15] I deployed the same instance on experimental production into two versions (cpu/gpu). The results are still inconsistent. Updates exist on phab ticket: https://phabricator.wikimedia.org/T393154#10804000 [10:20:06] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821023 (10OKarakaya-WMF) Sharing some notes: # Model/Training - What are common and un-common things among the languages? - Problem: Training is manual. - Sol... [10:26:11] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821031 (10kevinbazira) a:03kevinbazira [10:26:24] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821033 (10OKarakaya-WMF) Sharing some options that we consider: # Options - Option1: Reduce the number of models - Continue on the model experiments on both x... [10:27:52] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821039 (10kevinbazira) Hi @KartikMistry, we have uploaded all MinT models to a swift bucket: ` $ ls /home/kartik/models indictrans2 madlad400 nllb opusmt sha512sums.txt... [10:30:17] hi kart_: o/ MinT models have been uplaoded to Swift and the public wmf model repo: https://phabricator.wikimedia.org/T391958#10821031 [10:30:17] georgekyz: bartosz: ozge_: I've shared the notes from our model uploading session here: https://phabricator.wikimedia.org/P76139 [10:30:17] in case you'd like to reference them in the future [10:32:46] Thank you Kevin. I’ve shared some findings about add-a-link in the links above. I’ll go through them in the meeting today. [10:33:19] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821084 (10elukey) @kevinbazira there is something odd with the SHA512 in https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434, I see only `sha512sum... [10:33:22] sure sure [10:39:59] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821107 (10kevinbazira) @elukey the `sha512sums.txt` is based on `/home/kartik/models/sha512sums.txt`. I did check the SHA of the files and they match with T391958#10794576... [10:50:40] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821149 (10Michael) >>! In T393474#10803756, @OKarakaya-WMF wrote: > Hey, @Michael. We have recently started investigation about the add-a-link model. Please feel f... [10:55:09] (03CR) 10Nik Gkountas: [C:03+1] "Patch is good. Leaving +1 to address Niklas's comment" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 (owner: 10Sbisson) [10:58:45] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821183 (10Michael) >>! In T393474#10821033, @OKarakaya-WMF wrote: > Sharing some options that we consider: > > # Options > > - Option1: Reduce the number of model... [11:10:37] kevinbazira: Thanks. I will check and update the task as well. [11:12:27] (03CR) 10Nik Gkountas: [C:04-1] "The patch is fine. Just minor improvements suggested" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [11:47:36] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821370 (10Michael) [12:18:36] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821504 (10elukey) @kevinbazira the sha512 files are automatically created by the script that uploads the model binaries, it is stored in the puppet's repo: `./modules/prof... [12:36:19] 06Machine-Learning-Team: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301 (10elukey) 03NEW [12:36:39] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821567 (10elukey) Created T394301 to track the work of creating a new script :) [12:40:06] (03PS1) 10Bartosz Wójtowicz: inference-services: Upgrade pycommit setup. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) [12:44:48] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10821585 (10OKarakaya-WMF) Thank you @Michael , I've updated the options with your comments. About Elastic search: I want to take a look into the d... [12:46:41] (03CR) 10CI reject: [V:04-1] inference-services: Upgrade pycommit setup. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1145888 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [13:12:59] 06Machine-Learning-Team, 06Language and Product Localization: Create a new S3 bucket for MinT - https://phabricator.wikimedia.org/T391958#10821715 (10kevinbazira) @elukey thank you for creating the task to create the new script. I've added `.sha512` files for all MinT models in the public repo. [13:48:07] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10821914 (10kevinbazira) @elukey, @isarantopoulos: in T385173#10816452 the biggest layer size of the wmf-debian-vllm image is ~23.5GB (uncompressed). I uploaded this image to... [14:06:42] elukey: isaranto: o/ I shared the biggest layer sizes of the wmf-debian-vllm image in: https://phabricator.wikimedia.org/T385173#10821914 [14:06:42] whenever you get a minute, please let me know whether we'll be able to proceed with these sizes on the wikimedia docker registry. thanks! [14:08:25] kevinbazira: o/ 4GB is the max limit for a compressed layer [14:08:26] o/ sorry in meetings. thanks for sharing that! unfortunately the limit on the registry side is 4GB :( [14:09:38] :'( [14:09:57] what if we break down the pip install torch layer into 2 steps/layers? first install torch with ---no-deps and then install deps [14:10:18] that said this is a hack that might be expensive to maintain [14:11:50] the heaviest layer is `venv` in the runtime variant (final image): https://phabricator.wikimedia.org/P76040$101 [14:39:17] plz disregard my suggestion, it didnt make sense. probably then breaking down the copy command in multiple ones would solve this issue [15:38:41] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10822741 (10isarantopoulos) The compressed layers need to be less that 4GB so this will not work. Looking at the largest layer which is the copy of the venv directory that i... [15:41:14] (03PS5) 10Sbisson: Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) [15:41:26] (03CR) 10Sbisson: Popular/search recommander: use domain code in lllang parameter (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [15:42:48] (03CR) 10CI reject: [V:04-1] Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) (owner: 10Sbisson) [15:47:01] (03PS6) 10Sbisson: Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) [15:49:01] (03PS3) 10Sbisson: Remove unused simple english domain map entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 [16:15:57] * isaranto afk! [16:34:10] (03CR) 10Eamedina: [C:03+2] Remove unused simple english domain map entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 (owner: 10Sbisson) [16:34:50] (03Merged) 10jenkins-bot: Remove unused simple english domain map entries [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1145357 (owner: 10Sbisson) [16:40:14] (03PS7) 10Sbisson: Popular/search recommander: use domain code in lllang parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1143605 (https://phabricator.wikimedia.org/T306508) [16:51:16] 06Machine-Learning-Team, 10Editing-team (Tracking): Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10823214 (10gkyziridis) == Edit-Check Docker on ML-Lab2 Building container with older version of pytorch and rocm. Steps for reproduce: # Crea... [20:20:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:20:54] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [20:20:54] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:55:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [21:55:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [21:55:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:57:43] FIRING: LiftWingServiceErrorRate: ... [21:57:43] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=srwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:27:43] RESOLVED: LiftWingServiceErrorRate: ... [22:27:43] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=srwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate