[08:42:02] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: MI300 machines need startup tweaks - https://phabricator.wikimedia.org/T420507#11803249 (10DPogorzelski-WMF) [08:42:33] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: MI300 machines need startup tweaks - https://phabricator.wikimedia.org/T420507#11803250 (10DPogorzelski-WMF) 05Open→03In progress [09:14:44] FIRING: LiftWingServiceErrorRate: ... [09:14:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:31:48] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Update kserve Python package to 0.17 across all inference services - https://phabricator.wikimedia.org/T422591#11803498 (10achou) [09:50:50] FIRING: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [09:50:50] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [10:04:44] RESOLVED: LiftWingServiceErrorRate: ... [10:04:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:05:50] RESOLVED: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [10:05:50] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [10:12:16] 06Machine-Learning-Team, 10ORES: Help migrate WikiLoop - https://phabricator.wikimedia.org/T342959#11803620 (10Aklapper) a:05achou→03None @achou: No reply, unassigning. [10:12:33] 06Machine-Learning-Team, 05Goal: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153#11803622 (10Aklapper) 05Open→03Resolved @achou: No reply; closing. [11:29:40] 06Machine-Learning-Team, 06Research: AI/ML Model Request: Text-to-Speech - https://phabricator.wikimedia.org/T419288#11803859 (10isarantopoulos) Just jotting some requirements down to make sure we're all aligned. ===== Model Requirements * Natural, expressive, human-like speech with conversational flow * M... [11:55:42] (03PS1) 10AikoChou: revise-tone-task-generator: upgrade to kserve 0.17 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269418 (https://phabricator.wikimedia.org/T422797) [12:00:45] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Update kserve Python package to 0.17 across all inference services - https://phabricator.wikimedia.org/T422591#11803963 (10achou) [14:32:04] (03PS1) 10AikoChou: edit-check: upgrade to kserve 0.17 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269480 (https://phabricator.wikimedia.org/T422812) [14:53:30] (03PS1) 10Gkyziridis: EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 [14:54:53] (03PS2) 10Gkyziridis: EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) [15:00:05] (03CR) 10CI reject: [V:04-1] EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) (owner: 10Gkyziridis) [15:05:29] (03PS3) 10Gkyziridis: EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) [15:09:49] (03CR) 10CI reject: [V:04-1] EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) (owner: 10Gkyziridis) [15:18:42] 10Lift-Wing, 06Machine-Learning-Team (Q4 FY2025-26): Investigate enabling gRPC in LiftWing model servers - https://phabricator.wikimedia.org/T421903#11805086 (10elukey) Hey! Adding a few notes/thoughts: >>! In T421903#11778065, @klausman wrote: > For LW services making outgoing gRPC requests, the detail... [15:25:37] (03PS4) 10Gkyziridis: EnableSwuaggerUI: PoC for using swagger ui on edit-check model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) [16:00:04] (03CR) 10Ilias Sarantopoulos: "I think this is a wrong approach. It would be best to do things incrementally and evaluate if the next step is actually needed as written " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1269491 (https://phabricator.wikimedia.org/T332602) (owner: 10Gkyziridis)