[00:01:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:01:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [00:01:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:06:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:06:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [00:06:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:23:57] (03PS2) 10Kevin Bazira: article-country: add support for wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) [05:25:19] (03CR) 10CI reject: [V:04-1] article-country: add support for wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [05:28:14] (03PS3) 10Kevin Bazira: article-country: add support for wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) [05:32:20] (03CR) 10Kevin Bazira: article-country: add support for wikilink-related predictions (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [08:17:32] hello wonderful ppl! [08:23:06] o/ [08:23:21] I noticed https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1126070, wondering how batch works in this case [08:23:48] \ο [08:25:52] morning folks! [08:26:02] batch in this case doesnt mean many requests at once. The model scores sentences so batching is applied on the sentence level. This is why predict for this service is so intensive as it scores many sentences. So bigger articles > higher latency [08:32:58] ahh right make more sense thanks! [08:33:03] hope it alleviates the issue [08:40:54] I hope too! [08:56:02] 06Machine-Learning-Team, 10Wikilabels: Admin interface for WPX UI - https://phabricator.wikimedia.org/T120902#10627485 (10Aklapper) [09:10:10] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [09:12:51] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [09:13:36] (03Merged) 10jenkins-bot: article-country: add support for wikilink-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1125661 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [09:23:57] (03PS8) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [09:25:57] (03CR) 10Ilias Sarantopoulos: "Thanks for working on this! This is nice I added a couple of suggestions. Ping me if you need to discuss any of these further" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:59:11] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10627694 (10isarantopoulos) a:03isarantopoulos [10:30:10] (03PS1) 10Ilias Sarantopoulos: reference-quality: update knowledge integrity [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126945 (https://phabricator.wikimedia.org/T387019) [10:40:47] this change from knowledge integrity wasnt included in the latest image because the layer was cached https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/commit/5091fa220272f70359b0a547d5957f55448a1cfb [10:41:10] whenever someone has time please review --^ [10:41:39] we should require tags to be included in KI to avoid this type of issues [10:53:34] (03CR) 10AikoChou: [C:03+1] reference-quality: update knowledge integrity [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126945 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [10:59:44] FIRING: LiftWingServiceErrorRate: ... [10:59:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:05:05] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: update knowledge integrity [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126945 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:05:41] The above is because the alert expired [11:05:50] (03Merged) 10jenkins-bot: reference-quality: update knowledge integrity [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126945 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:15:56] Morning! [11:20:25] Guten Tag o/ [11:36:59] apparently setting the wrong environment var name didnt' help :P [11:37:06] I was setting BATCH instead of BATCH_SIZE [11:37:47] ah, classic. There is a German saying: "Kaum macht man es richtig, geht es schon" (Once you do it right, it suddenly works!) [11:59:49] (03PS9) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [12:00:31] (03PS10) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [12:01:27] (03PS11) 10Gkyziridis: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) [12:23:46] (03CR) 10Gkyziridis: inference-services: Develop loading peacock model logic. (039 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:28:45] (03CR) 10Ilias Sarantopoulos: [C:03+1] inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:56:16] sharing a virtual event happening today "AI in Production"! https://home.mlops.community/home/events/ai-in-production-2025 [14:29:00] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10628769 (10MunizaA) >>! In T387019#10624928, @isarantopoulos wrote: > Reference-need however which was the original problem still experiences high throttling. I understa... [15:04:16] (03CR) 10Gkyziridis: [C:03+2] "Merging" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:07:20] (03Merged) 10jenkins-bot: inference-services: Develop loading peacock model logic. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1126012 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:15:10] (03PS1) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:15:21] (03PS2) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:16:01] (03CR) 10CI reject: [V:04-1] reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (owner: 10Ilias Sarantopoulos) [15:16:19] (03PS3) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:17:01] (03CR) 10CI reject: [V:04-1] reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (owner: 10Ilias Sarantopoulos) [15:17:19] (03PS4) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:18:02] (03CR) 10CI reject: [V:04-1] reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (owner: 10Ilias Sarantopoulos) [15:18:18] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.01 - 2025.03.21), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10629083 (10Gehel) >>! In T379119#10619055, @Jelto wrote:... [15:18:24] (03PS5) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:18:53] thanks for sharing Aiko! [15:19:04] (03CR) 10CI reject: [V:04-1] reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (owner: 10Ilias Sarantopoulos) [15:19:47] (03PS6) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for preprocess/inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 [15:21:45] (03PS7) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (https://phabricator.wikimedia.org/T387019) [15:28:10] (03PS8) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (https://phabricator.wikimedia.org/T387019) [15:31:51] o/ whenever you get a minute, please review: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126963 [15:31:53] thanks! [16:08:14] (03PS9) 10Ilias Sarantopoulos: reference-quality: multiprocessing with process pool for inference [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1127052 (https://phabricator.wikimedia.org/T387019) [17:21:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:21:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [17:21:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:43:28] going afk folks, more stuff tomorrow! [18:37:40] 06Machine-Learning-Team, 06collaboration-services, 10Discovery-Search (2025.03.01 - 2025.03.21), 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10630050 (10EBernhardson) The issue with cirrusdoc is tha... [22:01:56] 06Machine-Learning-Team: Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10630933 (10ppelberg) [22:02:05] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10630935 (10ppelberg)