[05:56:31] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11712050 (10Sucheta-Salgaonkar-WMF) @OKarakaya-WMF could you please add a weekly summary for me to pull into Asana? [06:15:15] 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Deploy gpt-oss-safeguard-20b on LiftWing - https://phabricator.wikimedia.org/T418350#11712057 (10kevinbazira) >>! In T418350#11710631, @Alaexis wrote: >> Deploy the model-server in the LiftWing production namespace to provide an internal production... [06:35:18] 06Machine-Learning-Team: Compare performance of KServe huggingfaceserver with HuggingFace vs vLLM backend - https://phabricator.wikimedia.org/T395019#11712061 (10kevinbazira) In T418976#11705174, we migrated the embeddings isvc inference backend from HuggingFace Transformers to vLLM. The locust load test results... [06:40:04] 06Machine-Learning-Team: Compare performance of KServe huggingfaceserver with HuggingFace vs vLLM backend - https://phabricator.wikimedia.org/T395019#11712063 (10kevinbazira) 05Open→03Resolved [09:35:20] 06Machine-Learning-Team, 13Patch-For-Review: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11712494 (10elukey) Deployed on ml-serve-eqiad: ` root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helm3 -n kserve history kserve REVISION UPDATED... [09:52:38] 06Machine-Learning-Team, 13Patch-For-Review: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11712659 (10elukey) 05Open→03Resolved a:03elukey ` root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kube-env admin ml-serve-codfw root@deploy2002:/srv/de... [10:34:36] 06Machine-Learning-Team, 07OKR-Work: Deploy gpt-oss-safeguard-20b on LiftWing - https://phabricator.wikimedia.org/T418350#11712870 (10kevinbazira) We have deployed the gpt-oss-safeguard-20b LLM in the prod experimental namespace to access the MI300x GPU. The isvc starts in the pod on eqiad as shown below: {P8... [10:55:02] 06Machine-Learning-Team, 06Product Safety and Integrity: Deploy CoPE-A on LiftWing - https://phabricator.wikimedia.org/T418832#11712923 (10BWojtowicz-WMF) The CoPE-A-9B model [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1251272/ | is now deployed ]] on LiftWing. One can query the endpoi... [11:49:18] 06Machine-Learning-Team, 06Product Safety and Integrity: Deploy CoPE-A on LiftWing - https://phabricator.wikimedia.org/T418832#11713162 (10BWojtowicz-WMF) After deployment, CoPE-A-9B model server was successfully processing small requests of less than 500 input tokens. However, when testing the model server w... [11:58:05] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11713183 (10OKarakaya-WMF) Reporting format Progress update on the hypothesis for the week, including if something has shipped: - We have upda... [17:03:26] 06Machine-Learning-Team, 10GrowthExperiments-NewcomerTasks, 10Revise-Tone-Structured-Task, 06Growth-Team (FY2025-26 Q3 Sprint 5), 07OKR-Work: Ensure Test Wikipedia has Revise tone tasks - https://phabricator.wikimedia.org/T416904#11714711 (10Michael) [17:13:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:18:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:30:38] (03PS5) 10Thiemo Kreuz (WMDE): build: Updating composer dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1246497 (owner: 10Libraryupgrader) [22:34:45] (03CR) 10CI reject: [V:04-1] build: Updating composer dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1246497 (owner: 10Libraryupgrader) [22:35:39] (03CR) 10Umherirrender: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1246497 (owner: 10Libraryupgrader) [22:49:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:54:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate