[00:54:34] (03PS1) 10Tim Starling: Use the new RecentChangesPurgeQuery hook [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1182684 (https://phabricator.wikimedia.org/T403002) [02:08:45] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Data-Persistence, 10MediaWiki-Recent-changes, and 2 others: Index for RC filterable ORES scores - https://phabricator.wikimedia.org/T403003#11127159 (10tstarling) [06:22:33] o/ [07:00:15] good morning! [07:01:45] morning morning o/ [07:01:45] whenever anyone gets a minute please review: https://gerrit.wikimedia.org/r/1182506 [07:01:45] thanks! [07:12:29] good morning [07:19:11] good morning. [07:21:46] (03PS3) 10Tim Starling: Use the new RecentChangesPurgeQuery hook [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1182684 (https://phabricator.wikimedia.org/T403002) [07:21:47] (03CR) 10Tim Starling: "I added an integration test so that I wouldn't have to test it manually." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1182684 (https://phabricator.wikimedia.org/T403002) (owner: 10Tim Starling) [07:47:16] kevinbazira: I approved the patch ^^. [07:48:19] isaranto: o/ thanks for the review. going to deploy and test on prod! [07:48:33] **staging [07:52:13] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11127532 (10OKarakaya-WMF) >>! In T400902#11118248, @OKarakaya-WMF wrote: > hey @brouberol , > > I'm getting following errors. Could it be relat... [07:55:09] Guten Morgen o/ [07:59:36] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11127548 (10brouberol) When you run `kerberos-run-command analytics-ml yarn logs -appOwner analytics-ml -applicationId application_1754906949114_4... [07:59:46] 06Machine-Learning-Team, 06Growth-Team, 10Improve-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11127549 (10achou) a:03achou [08:20:53] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11127612 (10brouberol) The PVC does not exist in the `airflow-dev` namespace. It exists in... [08:23:34] (03PS13) 10Bartosz Wójtowicz: outlink-topic-model: Introduce caching mechanism. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1176448 (https://phabricator.wikimedia.org/T356256) [08:23:37] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127626 (10kevinbazira) articlequality isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-articlequality ml-stagi... [08:24:30] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127629 (10kevinbazira) articletopic isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-articletopic ml-staging-c... [08:29:39] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127647 (10kevinbazira) draftquality isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-draftquality ml-staging-c... [08:34:14] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127661 (10kevinbazira) drafttopic isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-drafttopic ml-staging-codfw kevinbazira@deploy100... [08:40:04] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127679 (10kevinbazira) damaging isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-editquality-damaging ml-staging-codfw kevinbazira@d... [08:44:28] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127687 (10kevinbazira) goodfaith isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-editquality-goodfaith ml-staging-codfw kevinbazira... [08:49:12] 06Machine-Learning-Team, 07Essential-Work: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127691 (10kevinbazira) reverted isvc running in staging: ` kevinbazira@deploy1003:~$ kube_env revscoring-editquality-reverted ml-staging-codfw kevinbazira@d... [09:08:51] all revscoring isvc deployments on staging are up and running. here is a patch for prod: https://gerrit.wikimedia.org/r/1182770 [09:08:51] please review whenever you get a minute. thanks! [09:18:11] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] fetch data from mwapi using revid instead of article title - https://phabricator.wikimedia.org/T371021#11127757 (10achou) The core logic of this model relies on Wikidata IDs that correspond to wikilinks in Wikipedia articles. Therefore, it's actually... [09:21:07] bartosz, isaranto: ---^ added some thoughts on this ticket related to what we discussed yesterday [09:23:22] Will take a look, thanks Aiko ! [09:28:21] Thank you Aiko! [09:45:21] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127808 (10kevinbazira) enwiki-damaging load test results are passing, but enwiki-goodfaith is failing because this isvc is not deploye... [09:57:26] (03CR) 10Ladsgroup: [C:03+2] Use the new RecentChangesPurgeQuery hook [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1182684 (https://phabricator.wikimedia.org/T403002) (owner: 10Tim Starling) [10:29:11] 10Lift-Wing, 06Machine-Learning-Team: [articletopic-outlink] fetch data from mwapi using revid instead of article title - https://phabricator.wikimedia.org/T371021#11127900 (10isarantopoulos) Thank you Aiko for the input! > if we want to use revid instead of article title I just want to highlight that we are... [10:35:11] (03Merged) 10jenkins-bot: Use the new RecentChangesPurgeQuery hook [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1182684 (https://phabricator.wikimedia.org/T403002) (owner: 10Tim Starling) [10:35:57] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127905 (10isarantopoulos) @kevinbazira it seems that are locust load test config doesn't match the deployed models on staging. After y... [10:40:10] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade revscoring model servers from debian bullseye to bookworm - https://phabricator.wikimedia.org/T400350#11127918 (10kevinbazira) @isarantopoulos, yep, after deploying in prod, I will open a task to fix the revscoring goodfaith locust load t... [11:01:54] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11127977 (10achou) Posting a problem that raised by @Eevans for the idea of having... [11:30:01] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11128005 (10OKarakaya-WMF) thanks @brouberol [Use_the_yarn_CLI](https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use... [11:33:14] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128015 (10gkyziridis) So, things are going to work in parallel. Basically we need first t... [11:56:39] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128104 (10brouberol) Ok, so if I create an equivalent PVC in the `airflow-dev` namespace... [12:00:12] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11128139 (10brouberol) Nice, good to know! [12:01:07] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128142 (10gkyziridis) I think yes! Otherwise I need to merge my branch in airflow-DAG mai... [12:35:43] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128272 (10brouberol) > Question: if you create an equivalent PVC in the airflow-dev name... [12:41:31] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128277 (10gkyziridis) Alright perfect. So, whenever have time implement a PVC with the sa... [12:42:34] Hey folks! [12:42:54] I noticed that you are experimenting with airflow and pushing models to Thanos Swift [12:44:33] this is great, but I have a big concern related to the account/bucket used in swift, since IIUC you are using the same for producing models and pull models for production pods [12:45:17] this is surely handy, but if anything goes wrong on the airflow side and something is pushed in the wrong path, there is the potential of affecting production [12:46:09] I would personally use a different account and bucket for training/producing new models, and a script to manually "promote" binaries from that bucket to the prod one when needed [12:46:55] security wise it is also more dangerous to allow anybody that can run an airflow dag for ML to modify binaries used by prod pods [12:47:07] was this use case discussed? [13:05:13] Another alternative would be to rely on Ceph/S3 provided by DPE SRE as the training scratch space, and pushing production-ready model to swift [13:05:30] any any case, I think elukey makes a good point [13:07:38] hello, I see 1 out of 314 wikis has failed in airflow https://airflow-ml.wikimedia.org/dags/add_a_link_pipeline/grid?tab=mapped_tasks&dag_run_id=manual__2025-08-27T14%3A28%3A34.089355%2B00%3A00&task_id=generate_anchor_dictionary I think airflow has failed to trigger it. I've created two MRs https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/27 [13:07:38] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1640 It adds option to skip if the step for the wiki is complete. This can help to re-run steps when needed rather than updating the shards list. I've also increased the re-try count. We can always set if to false when we need re-training. Can you review when you have time? @kevinbazira [13:08:22] I discussed this with George a while ago and I totally agree with that, using a different account/bucket in swift for training/producing new models [13:09:06] we haven't yet reached the stage of pushing trained models to thanos swift, as we're currently experimenting with PVC. but it's good time to start planing :) [13:10:52] brouberol: that's an option! [13:13:57] aiko: super thanks :) [13:16:08] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11128447 (10brouberol) ` brouberol@deploy1003:~$ cat pvc.yaml apiVersion: v1 kind: Persiste... [13:25:32] georgekyz: o/ when you have a moment let's chat about the tone check's SLO [13:43:17] elukey: we have a meeting in 20 mins, do you want to have a chat right now ? [14:01:43] sorry I was in a meeting, later on is fine on IRC! [14:04:43] perfect [14:05:07] My meeting will be finished in one hour [14:54:01] oh I forgot to change the hdfs path. https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1641/diffs Can you take a look when you're available? @kevinbazira [14:54:44] ack... looking! [15:00:18] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129025 (10Eevans) >>! In T401021#11127977, @achou wrote: > Posting a problem tha... [15:11:33] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129075 (10Ottomata) I like these ideas too. Q: could we generalize a bit for st... [15:19:14] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129112 (10Eevans) >>! In T401021#11129075, @Ottomata wrote: > I like these ideas... [15:20:28] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11129115 (10gkyziridis) Hey @elukey. > And I see that Ilias deployed on staging and prod the same day, just earlier on:... [15:23:31] elukey: I updated the SLO ticket: https://phabricator.wikimedia.org/T390706#11129115 with my thoughts. I am not sure if that info could help [15:26:20] georgekyz: thanks for the update! To clarify, I am ok if the service takes more time to compute, the main thing that I'd ask to to review the current/proposed SLO target for latency because it seems too tight. At the end the SLO that you'll choose needs to be ok for your team, and in the current discovery phase we are seeing the error budget always in the red zone [15:26:39] so that would mean alerts in the long run after a little while, and the SLO breached [15:29:10] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129144 (10Ottomata) I think that most requests for 'derived data storage' are re... [16:29:22] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129411 (10Eevans) >>! In T401021#11129144, @Ottomata wrote: > I think that many... [18:27:53] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11129770 (10Ottomata) > the logic to clear the weighted tags for... [19:02:50] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11129892 (10Ottomata) [19:02:55] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11129893 (10Ottomata) [20:08:44] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11130007 (10Ottomata) > I'm suggesting to use a composite key consisting of 4 primary keys: page_title, lang, model_version and threshold. This composite... [20:59:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:59:49] Deployment wikidatawiki-itemquality-predictor-default-00023-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [20:59:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revscoring-articlequality&var-deployment=wikidatawiki-itemquality-predictor-default-00023-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:04:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [21:04:49] Deployment wikidatawiki-itemquality-predictor-default-00023-deployment in revscoring-articlequality at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [21:04:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revscoring-articlequality&var-deployment=wikidatawiki-itemquality-predictor-default-00023-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas