[06:41:27] 06Machine-Learning-Team, 07OKR-Work: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P) - https://phabricator.wikimedia.org/T421461 (10kevinbazira) 03NEW [07:59:21] 06Machine-Learning-Team, 13Patch-For-Review: Experiment with new kserve version on stagin - https://phabricator.wikimedia.org/T419722#11756999 (10elukey) @DPogorzelski-WMF as you prefer, but the difference between istio/kserve and knative is not really huge and I would personally upgrade first rather than keep... [08:37:58] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Essential-Work, and 2 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#11757049 (10Gehel) [08:38:00] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Essential-Work, 07Kubernetes: Update kserve to v0.15.2* on ML clusters - https://phabricator.wikimedia.org/T380722#11757050 (10Gehel) [08:51:17] 06Machine-Learning-Team, 13Patch-For-Review: Experiment with new kserve version on ml-staging-codfw - https://phabricator.wikimedia.org/T419722#11757113 (10elukey) [09:48:46] 06Machine-Learning-Team, 07OKR-Work: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P) - https://phabricator.wikimedia.org/T421461#11757284 (10klausman) Adding @MoritzMuehlenhoff for security aspects. [10:51:00] 07artificial-intelligence, 10MediaWiki-Page-editing: Automatic edit summary generation based on analyzing the change made - https://phabricator.wikimedia.org/T14411#11757719 (10Nemoralis) [12:01:38] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11757971 (10isarantopoulos) 05Open→03Resolved [12:37:45] (03CR) 10Bartosz Wójtowicz: "Ohh this will be super useful to enable!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos) [12:39:00] (03CR) 10Bartosz Wójtowicz: "Making this unresolved" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos) [12:54:26] 07artificial-intelligence, 10MediaWiki-Page-editing: Automatic edit summary generation based on analyzing the change made - https://phabricator.wikimedia.org/T14411#11758124 (10Bugreporter2) Also note that a reasonable alternative here is {T15937} as that would mean bots could provide the summaries when they a... [13:22:42] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11758236 (10BWojtowicz-WMF) **Weekly Update** 1. We performed load test on the current production service to explore laten... [14:38:17] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#11758537 (10isarantopoulos) 05Open→03Declined We don't need this anymore since the teams has gone ahead by building and pushing the amd vllm docker i... [14:39:42] 06Machine-Learning-Team: Airflow training pipeline - https://phabricator.wikimedia.org/T363554#11758544 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos This task has been tackled as part of the retraining pipeline for tone check in {T398970} [14:42:28] 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#11758552 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos Since a production rollout to Sloth has been decided in {T404171#11576562} I'm resolving this task. [14:45:28] 06Machine-Learning-Team, 06Wikipedia-Android-App-Backlog: Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#11758562 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos This task was done and deployed on LiftWing. [14:50:49] 06Machine-Learning-Team, 07Essential-Work: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733#11758583 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos This task has not received any activity for quite a while. I will mak... [14:55:25] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#11758604 (10isarantopoulos) 05Open→03Resolved This task has not received any activity for quite a while. I will ma... [14:56:50] 06Machine-Learning-Team, 07Essential-Work: Investigate reference-need-predictor alert triggered by reverse proxying request error - https://phabricator.wikimedia.org/T399936#11758611 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos This task has not received any activity for quite a while. Sin... [15:51:01] Hey folks! Can I get a volunteer from the ml- team who I can yank into a slack chat about a security issue? aiko preferred if you're online :) [17:22:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:22:49] Deployment gpt-oss-safeguard-20b-predictor-00002-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [17:22:54] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:27:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:27:49] Deployment gpt-oss-safeguard-20b-predictor-00002-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [17:27:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:35:32] (03PS2) 10Ilias Sarantopoulos: docker-compose: Enable Swagger UI docs for all KServe services. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) [18:21:46] andrewbogott: o/ [18:21:51] actually I'm sending on slack