[00:09:50] 06Machine-Learning-Team, 10ORES, 10AntiSpoof, 10BetaFeatures, and 3 others: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052#11721355 (10Esanders) DiscussionTools also modified the appearance of pages, potentially making archives less readab... [01:49:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [01:54:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:30:39] (03PS1) 10Kevin Bazira: policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) [06:31:58] 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Deploy gpt-oss-safeguard-20b on LiftWing - https://phabricator.wikimedia.org/T418350#11721700 (10kevinbazira) [07:55:41] 07artificial-intelligence, 10Citoid: Citoid block needs information (supposedly Anubis, but single case to fix as blueprint) - https://phabricator.wikimedia.org/T420397#11721784 (10Mvolz) Our IP range is: 208.80.152.0/22 for IPv4 and 2620:0:860::/46 for IPv6 The exact user-agents are: Mozilla/5.0 (Macintosh;... [07:57:28] 07artificial-intelligence, 10Citoid: Request to be added to Anubis good bot list - https://phabricator.wikimedia.org/T420397#11721785 (10Mvolz) [08:12:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:12:49] Deployment gpt-oss-safeguard-20b-predictor-00007-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [08:12:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00007-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:52:35] 06Machine-Learning-Team: Investigate how to enable the swagger UI for InferenceService resources - https://phabricator.wikimedia.org/T332602#11721859 (10isarantopoulos) [09:01:12] (03PS1) 10Ilias Sarantopoulos: edit-check: Add support for KServe v2 inference protocol. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254844 (https://phabricator.wikimedia.org/T332602) [09:01:24] (03PS1) 10Ilias Sarantopoulos: docker-compose: Enable Swagger UI docs for all KServe services. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) [09:02:36] (03CR) 10CI reject: [V:04-1] docker-compose: Enable Swagger UI docs for all KServe services. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos) [09:19:38] (03CR) 10Ozge: [C:03+1] policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:22:01] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:22:55] (03Merged) 10jenkins-bot: policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:43:36] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos) [09:57:59] 06Machine-Learning-Team, 10Liberica, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Migrate ML k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420438 (10JMeybohm) 03NEW [10:02:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:02:49] Deployment gpt-oss-safeguard-20b-predictor-00007-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [10:02:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00007-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:13:26] dpogorzelski, klausman - just to avoid this falling through the cracks - https://phabricator.wikimedia.org/T400626#11696619 [11:13:51] Roger! [11:14:12] btw, I cold-restarted ml-serve2001 and it's back in service. It reported no DIMM errors or anything [11:15:17] super thanks :) [11:23:29] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with pre-defined edit types - https://phabricator.wikimedia.org/T418102#11722435 (10achou) **Experiment Plan** 1. Local Experiments - Use a smaller model. - Run on a curated set of articles (sampled across each pa_class and main_topic... [12:41:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:57:06] (03PS1) 10Kevin Bazira: policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) [13:06:16] (03CR) 10Ozge: [C:03+1] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:06:22] 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: Upgrade ML clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414485#11722907 (10DPogorzelski-WMF) This is done @MLechvien-WMF [13:11:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:28:48] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:29:29] (03Merged) 10jenkins-bot: policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [15:35:57] (03PS1) 10Kevin Bazira: policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) [15:58:58] (03CR) 10Ozge: [C:03+1] policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:01:02] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:01:34] (03Merged) 10jenkins-bot: policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:11:57] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Drop ORES tables from wikis without ORES - https://phabricator.wikimedia.org/T420093#11723864 (10Ahoelzl) [16:51:49] FIRING: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:52:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:05:45] o/ https://inference.svc.eqiad.wmnet:30443 is returning 503s [17:06:16] dcausse: o/ any specific URI? edit check? [17:06:36] elukey: it's https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-embedding:predict [17:08:54] ah yes zero pods scheduled [17:09:44] 0/15 nodes are available: 1 Insufficient amd.com/gpu, 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/15 nodes are available: 1 No preemption victims found for incoming pod, 14 Preemption is not helpful for scheduling. [17:10:15] so this one goes on ml-serve1012 or 1013 afaics, with mi300x [17:10:36] that were rebooted [17:10:41] and they lost their partitioning config [17:10:43] klausman: --^ [17:11:21] yeah, I just saw the alert [17:13:47] I can't find the docs on what needs to be done to partition the GPUs again, and root's history has nothing obvious [17:14:57] it depends what is the config that you folks choose for the partitions [17:16:02] it seems that the isvc just wants a gpu, not a specific size [17:17:07] So even an unpartitioned one should suffice, no? [17:18:06] ok the gpu plugin was probably started before everything was ready: amdgpu driver unavailable: stat /sys/module/amdgpu/drivers/: no such file or directory [17:18:11] I restarted it, now it looks better [17:18:23] ah, classic, a race condition [17:18:36] did you restart it on both 12 and 13? [17:18:49] yeah [17:18:56] thank you! [17:21:49] FIRING: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:23:00] ^^^ this has just recovered [17:24:50] sorry I wrote in #sre without realizing [17:24:54] EABITTIRED [17:25:12] anyway, I was chatting with Tobias about the following: [17:25:34] I think mi300x hosts need to have a way (in puppet?) to restore their configuration before they can accept traffic [17:25:34] maybe a daemon or something that goes before the kubelet [17:25:39] cc: dpogorzelski --^ [17:26:14] also, the team needs to decide how to partition those gpus [17:26:23] at the moment they are all using the whole memory [17:26:30] for each pod, that is a bit of a waste :( [17:26:36] anyway, ttl :) [17:26:49] RESOLVED: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:27:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [18:04:49] 06Machine-Learning-Team: MI300 machines need startup tweaks - https://phabricator.wikimedia.org/T420507 (10klausman) 03NEW [19:43:54] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11725220 (10ppelberg) [20:26:38] 06Machine-Learning-Team, 10ORES, 10AntiSpoof, 10BetaFeatures, and 3 others: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052#11725359 (10Dreamy_Jazz)