[00:09:50] <wikibugs>	 06Machine-Learning-Team, 10ORES, 10AntiSpoof, 10BetaFeatures, and 3 others: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052#11721355 (10Esanders) DiscussionTools also modified the appearance of pages, potentially making archives less readab...
[01:49:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[01:54:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:30:39] <wikibugs>	 (03PS1) 10Kevin Bazira: policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350)
[06:31:58] <wikibugs>	 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Deploy gpt-oss-safeguard-20b on LiftWing - https://phabricator.wikimedia.org/T418350#11721700 (10kevinbazira)
[07:55:41] <wikibugs>	 07artificial-intelligence, 10Citoid: Citoid block needs information (supposedly Anubis, but single case to fix as blueprint) - https://phabricator.wikimedia.org/T420397#11721784 (10Mvolz) Our IP range is: 208.80.152.0/22 for IPv4 and 2620:0:860::/46 for IPv6  The exact user-agents are:  Mozilla/5.0 (Macintosh;...
[07:57:28] <wikibugs>	 07artificial-intelligence, 10Citoid: Request to be added to Anubis good bot list - https://phabricator.wikimedia.org/T420397#11721785 (10Mvolz)
[08:12:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[08:12:49] <jinxer-wm>	 Deployment gpt-oss-safeguard-20b-predictor-00007-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[08:12:49] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00007-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:52:35] <wikibugs>	 06Machine-Learning-Team: Investigate how to enable the swagger UI for InferenceService resources - https://phabricator.wikimedia.org/T332602#11721859 (10isarantopoulos)
[09:01:12] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: edit-check: Add support for KServe v2 inference protocol. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254844 (https://phabricator.wikimedia.org/T332602)
[09:01:24] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: docker-compose: Enable Swagger UI docs for all KServe services. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602)
[09:02:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] docker-compose: Enable Swagger UI docs for all KServe services. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos)
[09:19:38] <wikibugs>	 (03CR) 10Ozge: [C:03+1] policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:22:01] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:22:55] <wikibugs>	 (03Merged) 10jenkins-bot: policy-violation: remove fuse_rope_kvcache config from gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254719 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:43:36] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254845 (https://phabricator.wikimedia.org/T332602) (owner: 10Ilias Sarantopoulos)
[09:57:59] <wikibugs>	 06Machine-Learning-Team, 10Liberica, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Migrate ML k8s apiserver and services to IPIP - https://phabricator.wikimedia.org/T420438 (10JMeybohm) 03NEW
[10:02:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[10:02:49] <jinxer-wm>	 Deployment gpt-oss-safeguard-20b-predictor-00007-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[10:02:49] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00007-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:13:26] <elukey>	 dpogorzelski, klausman - just to avoid this falling through the cracks - https://phabricator.wikimedia.org/T400626#11696619
[11:13:51] <klausman>	 Roger!
[11:14:12] <klausman>	 btw, I cold-restarted ml-serve2001 and it's back in service. It reported no DIMM errors or anything
[11:15:17] <elukey>	 super thanks :)
[11:23:29] <wikibugs>	 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with pre-defined edit types - https://phabricator.wikimedia.org/T418102#11722435 (10achou) **Experiment Plan** 1. Local Experiments     - Use a smaller model.     - Run on a curated set of articles (sampled across each pa_class and main_topic...
[12:41:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[12:57:06] <wikibugs>	 (03PS1) 10Kevin Bazira: policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350)
[13:06:16] <wikibugs>	 (03CR) 10Ozge: [C:03+1] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:06:22] <wikibugs>	 06Machine-Learning-Team, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes: Upgrade ML clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414485#11722907 (10DPogorzelski-WMF) This is done @MLechvien-WMF
[13:11:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=edit-check&var-backend=edit-check-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[13:28:48] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[13:29:29] <wikibugs>	 (03Merged) 10jenkins-bot: policy-violation: add configurable max_num_batched_tokens flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254914 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[15:35:57] <wikibugs>	 (03PS1) 10Kevin Bazira: policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350)
[15:58:58] <wikibugs>	 (03CR) 10Ozge: [C:03+1] policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:01:02] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:01:34] <wikibugs>	 (03Merged) 10jenkins-bot: policy-violation: enable concurrent request handling in gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1254958 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:11:57] <wikibugs>	 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Drop ORES tables from wikis without ORES - https://phabricator.wikimedia.org/T420093#11723864 (10Ahoelzl)
[16:51:49] <jinxer-wm>	 FIRING: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[16:52:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[17:05:45] <dcausse>	 o/ https://inference.svc.eqiad.wmnet:30443 is returning 503s
[17:06:16] <elukey>	 dcausse: o/ any specific URI? edit check?
[17:06:36] <dcausse>	 elukey: it's https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-embedding:predict
[17:08:54] <elukey>	 ah yes zero pods scheduled
[17:09:44] <elukey>	 0/15 nodes are available: 1 Insufficient amd.com/gpu, 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/15 nodes are available: 1 No preemption victims found for incoming pod, 14 Preemption is not helpful for scheduling.
[17:10:15] <elukey>	 so this one goes on ml-serve1012 or 1013 afaics, with mi300x
[17:10:36] <elukey>	 that were rebooted
[17:10:41] <elukey>	 and they lost their partitioning config
[17:10:43] <elukey>	 klausman: --^
[17:11:21] <klausman>	 yeah, I just saw the alert
[17:13:47] <klausman>	 I can't find the docs on what needs to be done to partition the GPUs again, and root's history has nothing obvious
[17:14:57] <elukey>	 it depends what is the config that you folks choose for the partitions
[17:16:02] <elukey>	 it seems that the isvc just wants a gpu, not a specific size
[17:17:07] <klausman>	 So even an unpartitioned one should suffice, no?
[17:18:06] <elukey>	 ok the gpu plugin was probably started before everything was ready: amdgpu driver unavailable: stat /sys/module/amdgpu/drivers/: no such file or directory
[17:18:11] <elukey>	 I restarted it, now it looks better
[17:18:23] <klausman>	 ah, classic, a race condition
[17:18:36] <klausman>	 did you restart it on both 12 and 13?
[17:18:49] <elukey>	 yeah
[17:18:56] <klausman>	 thank you!
[17:21:49] <jinxer-wm>	 FIRING: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:23:00] <klausman>	 ^^^ this has just recovered
[17:24:50] <elukey>	 sorry I wrote in #sre without realizing 
[17:24:54] <elukey>	 EABITTIRED
[17:25:12] <elukey>	 anyway, I was chatting with Tobias about the following:
[17:25:34] <elukey>	 I think mi300x hosts need to have a way (in puppet?) to restore their configuration before they can accept traffic
[17:25:34] <elukey>	 maybe a daemon or something that goes before the kubelet
[17:25:39] <elukey>	 cc: dpogorzelski --^
[17:26:14] <elukey>	 also, the team needs to decide how to partition those gpus
[17:26:23] <elukey>	 at the moment they are all using the whole memory
[17:26:30] <elukey>	 for each pod, that is a bit of a waste :(
[17:26:36] <elukey>	 anyway, ttl :)
[17:26:49] <jinxer-wm>	 RESOLVED: [5x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:27:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=llm&var-backend=knative-serving.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[18:04:49] <wikibugs>	 06Machine-Learning-Team: MI300 machines need startup tweaks - https://phabricator.wikimedia.org/T420507 (10klausman) 03NEW
[19:43:54] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11725220 (10ppelberg)
[20:26:38] <wikibugs>	 06Machine-Learning-Team, 10ORES, 10AntiSpoof, 10BetaFeatures, and 3 others: Drop extensions from closed wikis where the database tables are unused - https://phabricator.wikimedia.org/T420052#11725359 (10Dreamy_Jazz)