[01:52:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [01:52:49] Deployment gpt-oss-safeguard-20b-predictor-00002-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [01:52:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00002-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:54:10] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11742010 (10BWojtowicz-WMF) I'm sharing load test numbers tested against production deployment on eqiad using internal endpoint. I've made sure the responses return valid predic... [08:21:44] 06Machine-Learning-Team, 10Ceph, 06Infrastructure-Foundations, 10SRE-swift-storage: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742117 (10MatthewVernon) [08:53:21] 06Machine-Learning-Team: Add Slack notifications for Prometheus Alertmanager for ml-team - https://phabricator.wikimedia.org/T421040 (10isarantopoulos) 03NEW [09:24:41] 06Machine-Learning-Team, 13Patch-For-Review: Add Slack notifications for Prometheus Alertmanager for ml-team - https://phabricator.wikimedia.org/T421040#11742313 (10isarantopoulos) a:03isarantopoulos [09:36:33] 06Machine-Learning-Team, 07Essential-Work: Unify and improve load testing strategy for inference services - https://phabricator.wikimedia.org/T416475#11742354 (10BWojtowicz-WMF) When investigating T420931, I found that my custom async load test script achieves >300 RPS against the same service with 5 replicas,... [10:10:06] 06Machine-Learning-Team, 10Ceph, 06Infrastructure-Foundations, 10SRE-swift-storage: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742508 (10elukey) p:05Triage→03Medium [11:33:55] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11742845 (10isarantopoulos) @BWojtowicz-WMF thanks for running these tests! Although results look great the [[ https://grafana.wikimedia.org/goto/afgz9egh3puyod?orgId=1 | grafan... [11:41:04] FIRING: TestAlert: - lol - no - https://alerts.wikimedia.org/?q=alertname%3DTestAlert [12:01:04] RESOLVED: TestAlert: - lol - no - https://alerts.wikimedia.org/?q=alertname%3DTestAlert [12:25:11] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with pre-defined edit types - https://phabricator.wikimedia.org/T418102#11743021 (10achou) Key observations from model outputs ([[ https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/defined-edit-types/edit_suggest... [13:06:36] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11743213 (10BWojtowicz-WMF) @isarantopoulos I see the regime with >10s p99 latencies, however it happened during the night and not during running those tests. It seems to me t... [13:19:44] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11743381 (10isarantopoulos) Great, thanks for clarifying! @Seddon are the numbers reported above T420931#11742010 good for you in case you integrate directly with LiftWing? [13:24:16] (03PS1) 10Ilias Sarantopoulos: revertrisk-wikidata: add predictions to events stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1259966 (https://phabricator.wikimedia.org/T420883) [14:02:56] 06Machine-Learning-Team, 07OKR-Work: Enable EmptyDir (/dev/shm) support for KServe InferenceServices to unblock NCCL-based tensor parallelism - https://phabricator.wikimedia.org/T421105 (10kevinbazira) 03NEW [14:09:24] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with loose edit types - https://phabricator.wikimedia.org/T418097#11744148 (10OKarakaya-WMF) a:03OKarakaya-WMF [14:12:35] 06Machine-Learning-Team: Edit Suggestions - Edit suggestion generation with loose edit types - https://phabricator.wikimedia.org/T418097#11744163 (10OKarakaya-WMF) Edit suggestions are generated and findings are shared in the [scratchpad](https://docs.google.com/document/d/19tOyArAzCrSbLIiOKRJFwFWc9E9VYTvaQSfEry... [14:26:59] 06Machine-Learning-Team: Edit Suggestions - Eval with LLM-as-a-judge - https://phabricator.wikimedia.org/T421118 (10OKarakaya-WMF) 03NEW [14:27:28] 06Machine-Learning-Team: Edit Suggestions - Eval with LLM-as-a-judge - https://phabricator.wikimedia.org/T421118#11744253 (10OKarakaya-WMF) [15:08:55] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 10Event-Platform: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892#11744529 (10achou) @gkyziridis quick follow-up: what's the current status of this task... [16:18:53] 06Machine-Learning-Team, 10GrowthExperiments-NewcomerTasks, 10Revise-Tone-Structured-Task, 06Growth-Team (FY2025-26 Q3 Sprint 6), 07OKR-Work: Ensure Test Wikipedia has Revise tone tasks - https://phabricator.wikimedia.org/T416904#11744996 (10Michael) [16:25:17] 06Machine-Learning-Team, 07OKR-Work: Load test current state of the Article Topic service - https://phabricator.wikimedia.org/T420931#11745072 (10Ottomata) > It seems the overhead of the additional query for getting outlinks linked to a specific revision_id is significant, especially under load test scenario... [17:14:56] (03PS1) 10Triciaburmeister: ores-legacy: update doc links [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1260048 (https://phabricator.wikimedia.org/T406369) [17:27:50] 10Lift-Wing, 06Tech-Docs-Team, 13Patch-For-Review: Lift Wing API documentation standardization - https://phabricator.wikimedia.org/T406369#11745587 (10TBurmeister) == Summary of Lift Wing API doc changes == * Designed and implemented new information architecture and subpage structure to support task- and au... [17:34:49] 10Lift-Wing, 06Tech-Docs-Team, 13Patch-For-Review: Lift Wing API documentation standardization - https://phabricator.wikimedia.org/T406369#11745622 (10TBurmeister) 05In progress→03Resolved [20:19:37] 06Machine-Learning-Team, 10GrowthExperiments-NewcomerTasks, 10Revise-Tone-Structured-Task, 06Growth-Team (FY2025-26 Q3 Sprint 6), 07OKR-Work: Ensure Test Wikipedia has Revise tone tasks - https://phabricator.wikimedia.org/T416904#11746512 (10Etonkovidova) 05Open→03Resolved