[02:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:03:58] 06Machine-Learning-Team, 10Semantic Search, 07OKR-Work, 13Patch-For-Review: Migrate embeddings inference service from Transformers+FA2 to vLLM - https://phabricator.wikimedia.org/T418976#11676540 (10kevinbazira) The new embeddings model-server that uses vLLM as the inference backend instead of transformers... [08:54:31] 06Machine-Learning-Team, 10Semantic Search, 07OKR-Work: Migrate embeddings inference service from Transformers+FA2 to vLLM - https://phabricator.wikimedia.org/T418976#11676658 (10OKarakaya-WMF) transformer test1 (light): ` (venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embedd... [10:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:27:03] (03PS1) 10Kevin Bazira: embeddings: install ROCm device libs to fix runtime compilation failure [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248457 (https://phabricator.wikimedia.org/T418976) [12:28:21] (03CR) 10Ozge: [C:03+2] embeddings: install ROCm device libs to fix runtime compilation failure [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248457 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:28:55] (03Merged) 10jenkins-bot: embeddings: install ROCm device libs to fix runtime compilation failure [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248457 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:40:28] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11677333 (10DPogorzelski-WMF) I'll try a few fixes on the side on staging [13:10:51] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11677394 (10DPogorzelski-WMF) what works once only: `kubectl delete crd inferenceservices.serving.kserve.io --cascade=true` `helmfile -e ml-staging-codfw sync` then the issue comes back. i will tr... [13:34:03] (03PS1) 10Kevin Bazira: embeddings: install ROCm libs to add missing dev headers required by AITER [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248475 (https://phabricator.wikimedia.org/T418976) [13:37:26] (03CR) 10Ozge: [C:03+2] embeddings: install ROCm libs to add missing dev headers required by AITER [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248475 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [13:37:57] (03Merged) 10jenkins-bot: embeddings: install ROCm libs to add missing dev headers required by AITER [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1248475 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [13:50:57] 06Machine-Learning-Team, 06Growth-Team, 10New-Engagement-Experiments, 06Research: [RFC] Personalized article recommendations for Newcomer Tasks using content-based filtering - https://phabricator.wikimedia.org/T418051#11677531 (10Aditya_Pola) [14:08:58] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040#11677589 (10DPogorzelski-WMF) Seems that doesn't matter how you handle it the result is the same. needs more investigation on the cert-manager side [14:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:44:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [16:44:49] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [16:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:48:25] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 07Documentation: Update ORES deprecation pages and de-duplicate content about moving from ORES to Lift Wing - https://phabricator.wikimedia.org/T419148 (10TBurmeister) 03NEW [18:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:45:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:45:04] Deployment embeddings-predictor-00006-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=embeddings-predictor-00006-deployment - ... [20:45:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:58:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:00:05] 06Machine-Learning-Team, 10CirrusSearch, 10Semantic Search, 06Discovery-Search (2026.02.02 - 2026.02.27): qwen3-embedding:predict returning 503 to all requests - https://phabricator.wikimedia.org/T419174 (10EBernhardson) 03NEW [22:03:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [22:05:45] 06Machine-Learning-Team, 10CirrusSearch, 10Semantic Search, 06Discovery-Search (2026.02.02 - 2026.02.27): qwen3-embedding:predict returning 503 to all requests - https://phabricator.wikimedia.org/T419174#11679546 (10EBernhardson) Potentially related: T418976. Not certainly, but that ticket involved change... [22:24:14] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent