[00:45:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:45:04] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [00:45:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:45:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:45:04] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [04:45:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:26:40] looking ... [06:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:45:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:45:04] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [08:45:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:14:26] hello from here as well! [09:23:41] 06Machine-Learning-Team, 06Product Safety and Integrity: Deploy CoPE-A on LiftWing - https://phabricator.wikimedia.org/T418832#11680682 (10BWojtowicz-WMF) **Update on quantization experiments** I attempted to quantize CoPE-A-9B to fit within the 16 GB VRAM available on our partitioned MI300X GPUs on LiftWing.... [10:16:57] isaranto: wow hello! Welcome back! [10:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:08:30] 06Machine-Learning-Team, 10Prod-Kubernetes, 07Kubernetes: Upgrade ML clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414485#11681061 (10MLechvien-WMF) Checking in on this as we approach the end of the quarter, @DPogorzelski-WMF do you have an overall ETA for this? [11:25:59] o/ elukey , great to be back! [11:33:30] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11681186 (10BWojtowicz-WMF) **Weekly Update** 1. The initial integration code adding Adapter for gRPC <-> HTTP communicati... [11:35:00] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with loose edit types - https://phabricator.wikimedia.org/T418097#11681195 (10OKarakaya-WMF) [12:36:57] (03PS1) 10Kgraessle: Expose the revert risk language agnostic prediction boolean via the RecentChanges API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1248799 (https://phabricator.wikimedia.org/T407552) [12:45:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:45:04] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [12:45:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:35:26] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements - https://phabricator.wikimedia.org/T398948#11681662 (10DPogorzelski-WMF) [13:35:56] 06Machine-Learning-Team, 13Patch-For-Review: Fix revertrisk Pyrra SLO - https://phabricator.wikimedia.org/T419235#11681663 (10Aklapper) [14:20:13] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11682160 (10elukey) @DPogorzelski-WMF check https://github.com/kserve/kserve/pull/3890#discussion_r1734596750 So in theory removing caBundle entry from the CRD itself should fix the problem, but... [14:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:04:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:04:49] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [15:04:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:27:01] 06Machine-Learning-Team, 10CirrusSearch, 10Semantic Search, 06Discovery-Search (2026.02.02 - 2026.02.27): qwen3-embedding:predict returning 503 to all requests - https://phabricator.wikimedia.org/T419174#11683317 (10EBernhardson) 05Open→03Resolved [20:22:53] 06Machine-Learning-Team, 06Research: AI/ML Model Request: **[Project title here]** - https://phabricator.wikimedia.org/T419287 (10Sucheta-Salgaonkar-WMF) 03NEW [20:25:15] 06Machine-Learning-Team, 06Research: AI/ML Model Request: Image auto-crop / focus point detection - https://phabricator.wikimedia.org/T419287#11683551 (10Sucheta-Salgaonkar-WMF) [20:27:43] 06Machine-Learning-Team, 06Research: AI/ML Model Request: Text-to-Speech - https://phabricator.wikimedia.org/T419288 (10Sucheta-Salgaonkar-WMF) 03NEW [20:28:15] 06Machine-Learning-Team, 06Research: AI/ML Model Request: Text-to-Speech - https://phabricator.wikimedia.org/T419288#11683565 (10Sucheta-Salgaonkar-WMF) Shell ticket for @SherryYang-WMF to fill out [20:28:37] 06Machine-Learning-Team, 06Research: AI/ML Model Request: Image auto-crop / focus point detection - https://phabricator.wikimedia.org/T419287#11683570 (10Sucheta-Salgaonkar-WMF) Shell ticket for @SherryYang-WMF to fill out [21:14:45] 10Lift-Wing, 06Tech-Docs-Team: Lift Wing API documentation standardization - https://phabricator.wikimedia.org/T406369#11683752 (10TBurmeister) Status update: * Iterated on content, structure, and design of [[https://wikitech.wikimedia.org/wiki/User:TBurmeister_(WMF)/Sandbox/Machine_learning/API | Lift Wing AP... [21:28:01] 10Lift-Wing, 06Tech-Docs-Team: Lift Wing API documentation standardization - https://phabricator.wikimedia.org/T406369#11683778 (10TBurmeister) [22:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent