[07:11:43] FIRING: LiftWingServiceErrorRate: ... [07:11:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:51:44] RESOLVED: LiftWingServiceErrorRate: ... [08:51:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:30:59] 07artificial-intelligence, 10Citoid: Request to be added to Anubis good bot list - https://phabricator.wikimedia.org/T420397#11748180 (10Elya) Update: the techies at the Hamburg Museum added exactly the stuff you commited upstream, and it works - thanks for the heads up! I hope it will be inherited by other l... [11:05:42] 07artificial-intelligence, 10Citoid: Request to be added to Anubis good bot list - https://phabricator.wikimedia.org/T420397#11748271 (10Mvolz) 05Open→03Resolved [11:05:55] 07artificial-intelligence, 10Citoid: Request to be added to Anubis good bot list - https://phabricator.wikimedia.org/T420397#11748272 (10Mvolz) >>! In T420397#11748180, @Elya wrote: > Update: the techies at the Hamburg Museum added exactly the stuff you commited upstream, and it works - thanks for the hea... [12:06:19] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team (Kanban), and 2 others: Enable revert risk filters for first batch of wikis: < 1000 monthly edits - https://phabricator.wikimedia.org/T411485#11748509 (10Kgraessle) 05Stalled→03Open [13:08:13] 06Machine-Learning-Team, 10ORES, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Drop ORES tables from wikis without ORES - https://phabricator.wikimedia.org/T420093#11748765 (10Marostegui) a:03Marostegui We should rename these tables first before dropping. [14:16:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:16:49] Deployment gpt-oss-safeguard-20b-predictor-00002-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [14:16:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:06:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [16:06:49] Deployment gpt-oss-safeguard-20b-predictor-00002-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [16:06:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:30:41] 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Enable EmptyDir (/dev/shm) support for KServe InferenceServices to unblock NCCL-based tensor parallelism - https://phabricator.wikimedia.org/T421105#11749841 (10kevinbazira) Thanks to @klausman, we: * [enabled kubernetes.podspec-volumes-emptydir in Kna...