[03:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [05:39:32] 06Machine-Learning-Team, 07Essential-Work: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator - https://phabricator.wikimedia.org/T406302#11301301 (10kevinbazira) 05Open→03Resolved Since using the `TriggerDagRunOperator` for cross-DAG orchestration requires one to always check t... [06:29:52] o/ good morning [07:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [07:14:11] good morning [07:15:04] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11301432 (10kevinbazira) Hi @Trokhymovych, just following up on our discussion from yesterday... [07:36:56] good morning [07:38:31] good morning. :) [08:17:24] 06Machine-Learning-Team: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068 (10isarantopoulos) 03NEW [08:17:46] 06Machine-Learning-Team: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11301550 (10isarantopoulos) a:03BWojtowicz-WMF [08:19:47] 06Machine-Learning-Team: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11301572 (10isarantopoulos) This is issue was raised by @jsn.sherman on slack. The current model in production is https://analytics.wikimedia.org/published/wmf-ml-models/revertr... [08:26:56] 06Machine-Learning-Team: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11301601 (10BWojtowicz-WMF) Looking into it! I can reproduce this issue on my machine. I’ve also confirmed that we luckily don’t encounter this issue on LiftWing, which is inter... [08:27:08] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. - https://phabricator.wikimedia.org/T371021#11301603 (10isarantopoulos) Let's also update the API GW documentation and then resolve this https://api.wikimedia.org/wiki/Lift_Wing_API/... [08:38:16] (03PS1) 10Bartosz Wójtowicz: revertrisk: Ensure new 'typing_extensions' version is used in docker. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1198282 (https://phabricator.wikimedia.org/T408068) [08:40:07] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11301679 (10BWojtowicz-WMF) I think I found the culprit - the issue stems from our base docker image, which contains the old version of `typing_extensions`... [08:47:59] (03PS2) 10Bartosz Wójtowicz: revertrisk: Ensure new 'typing_extensions' version is used in docker. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1198282 (https://phabricator.wikimedia.org/T408068) [09:01:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [09:06:28] RESOLVED: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [09:18:20] o/ klausman: are those alerts related to the previous changes we've been applying, or is this something new? [09:18:39] Those were new changes that I pushed this morning [09:19:09] For soem sampling reason, there is always another firing alert just beforet the alert resolves [09:27:27] 06Machine-Learning-Team, 13Patch-For-Review: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11301888 (10isarantopoulos) 05Open→03Resolved >>! In T403378#11161954, @gerritbot wrote: > Change #1186447 **merged** by jenkins-bot: > %%%[operations/deployment-charts... [10:03:41] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11302007 (10gkyziridis) I am pasting here some useful information and links based on our meet... [10:11:11] 06Machine-Learning-Team, 13Patch-For-Review: Export retrained Tone-check model to an S3 bucket - https://phabricator.wikimedia.org/T406217#11302038 (10gkyziridis) a:03gkyziridis [10:12:16] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11302053 (10gkyziridis) a:03gkyziridis [10:49:01] klausman: o/ if that alarms get annoying it is ok to delay the checks to say once e very X weeks or month, it is not that important to have admin_ng always up-to-date [11:43:34] (03PS3) 10Bartosz Wójtowicz: revertrisk: Add CPU version for revertrisk-multilingual. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1198282 (https://phabricator.wikimedia.org/T408068) [11:56:31] elukey: agreed˙I'll probably bump it to alert on 1wk out of date [13:54:43] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11303000 (10jsn.sherman) Thanks for looking into this. I'm now seeing what looks like another dependency error: ` Attaching to revertrisk-multilingual-1 re... [14:11:02] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11303076 (10Eevans) [14:12:27] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11303082 (10Eevans) [14:24:39] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11303110 (10BWojtowicz-WMF) @jsn.sherman Hmm this is very interesting, I could not reproduce it on my Mac machine yet. Can you share the exact commands tha... [14:36:08] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11303155 (10jsn.sherman) Yep, I have the model in the same place as you. I get same output for revertrisk-multilingual-cpu: `sh $ docker compose --env-fil... [14:39:50] 06Machine-Learning-Team, 13Patch-For-Review: Revertrisk multilingual fails locally when ran with docker compose - https://phabricator.wikimedia.org/T408068#11303171 (10jsn.sherman) `sh $ docker compose build revertrisk-multilingual-cpu [+] Building 91.5s (26/26) FINISHED => [internal] load local bake definiti... [15:03:48] 06Machine-Learning-Team, 10Cassandra, 05Goal, 07OKR-Work: Provision Cassandra + Data Gateway resources for Tone Check - https://phabricator.wikimedia.org/T408129 (10Eevans) 03NEW [15:03:52] 06Machine-Learning-Team, 10Cassandra, 05Goal, 07OKR-Work: Provision Cassandra + Data Gateway resources for Tone Check - https://phabricator.wikimedia.org/T408129#11303307 (10Eevans) p:05Triage→03Medium [16:30:33] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11303725 (10elukey) To keep archives happy: with rocm 7.0.2, I needed to add: ` elukey@ml-serve1012:/usr/lib/x86_64-linux-gnu$ ls -l libdrm_amdgpu.so lrwxrwxrwx 2 root root 24 Apr 1 2025 li... [19:37:44] FIRING: LiftWingServiceErrorRate: ... [19:37:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-pre-save-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:42:44] RESOLVED: LiftWingServiceErrorRate: ... [19:42:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-pre-save-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [20:47:10] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11304554 (10ldelench_wmf)