[01:09:44] RESOLVED: LiftWingServiceErrorRate: ... [01:09:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [01:46:44] FIRING: LiftWingServiceErrorRate: ... [01:46:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [05:04:48] hello! [05:14:00] I'm seeing a lot of mwapi related errors in itwiki-damaging. here is an excerpt from the logs https://phabricator.wikimedia.org/P78662 [05:18:28] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10940689 (10kevinbazira) >>! In T396495#10934785, @gkyziridis wrote: > Please check in the [[ https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines | ml-pipelines ]] reposit... [05:45:44] good morning! [06:02:12] o/ isaranto: Looking at our code, it seems our default timeout is currently set to 5s, maybe we should increase it slightly? We could also add some re-try logic for the requests to harden it a little [06:03:35] Multiple of those errors also occurred at the same timestamp (`05:07:52`) so maybe it's also some concurrency issue 🤔 [06:06:59] +1 on retrying (I don't recall if we do or not at the moment) but I'm not sure about increasing the timeout > 5s as it is already quite high [06:10:45] * isaranto afk for 1h- physio appt [06:14:33] I don't see any re-try logic in the code atm so adding a backoff mechanism should already be a nice improvement [06:53:45] Good morning. [07:08:31] Moin Moin [07:18:53] back! [07:55:18] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#10940981 (10achou) Update: - Working with Diego on methodology for analyzing peacock language detection (tone check) models in languages without enough evaluation data. The methodology i... [08:00:44] FIRING: LiftWingServiceErrorRate: ... [08:00:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:10:44] RESOLVED: LiftWingServiceErrorRate: ... [08:10:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:33:22] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10941073 (10OKarakaya-WMF) So far, 33/34 is above the threshold. only ttwiki is below: ` threshold,N,micro_precision,micro_recall,wiki... [08:43:44] FIRING: LiftWingServiceErrorRate: ... [08:43:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:04:15] klausman_: re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1162985 we should note that moving to s3 will mostly help in faster download, but extracting models and running download for each pods won't change. so, we surely need timeout fix :/ [09:04:59] yeah, but we might as well do the s3 thing on staging, see how much faster it is and asjust the numbers in my change accordingly [09:05:09] sure [09:05:28] I've review for my patch, I'll fix some bits. [09:05:35] I agree it's likely still a necessary change. I also have some thought's about mabe doing the unpacking in parallel, to speed it up. [09:06:14] Good idea, let me take a look at that as well. Feel free to review :) [09:07:00] Let's get S3 working first, then speed it up later ;) [09:08:44] RESOLVED: LiftWingServiceErrorRate: ... [09:08:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:39:44] FIRING: LiftWingServiceErrorRate: ... [09:39:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:52:12] 06Machine-Learning-Team, 06Moderator-Tools-Team: AI/ML Infrastructure Request: Persist historical revert risk multilingual model scores for threshold analysis - https://phabricator.wikimedia.org/T397187#10941368 (10DMburugu) >Can you please share more about what specific modules you'll be testing that will use... [10:30:11] 06Machine-Learning-Team, 06Moderator-Tools-Team: AI/ML Infrastructure Request: Persist historical revert risk multilingual model scores for threshold analysis - https://phabricator.wikimedia.org/T397187#10941647 (10BTullis) > We would like to store historical revert risk multilingual model scores somewhere so... [11:27:27] 06Machine-Learning-Team: Build model training pipeline using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#10941826 (10kevinbazira) In T396495#10903362, I run an initial [ml training pipeline stub](https://gitlab.wikimedia.org/kevinbazira/airflow-dags/-/blob/ce057c846a4c3999b9358ee75f773030... [11:28:20] o/ [11:28:20] the example ml training pipeline that seperates job logic from schedulic logic now runs end-to-end on k8s in an airflow-devenv: https://phabricator.wikimedia.org/T396495#10941826 [11:29:09] kevinbazira: Woo-hoo! Good stuff. [11:46:53] awesome 🎉! [12:07:47] 06Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730#10942075 (10Aklapper) a:05Simonmaignan→03None @Simonmaignan Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this t... [12:23:22] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212#10942236 (10Aklapper) a:05kevinbazira→03None @kevinbazira: Removing task assignee as this open task has been assigned for more than two... [12:23:32] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07Chinese-Sites: Investigate `UnicodeEncodeError` thrown by Add-A-Link training pipeline for fywiki model - https://phabricator.wikimedia.org/T325521#10942238 (10Aklapper) a:05kevinbazira→03None @kevinbazira: Removing task assignee as this open task h... [12:24:13] 10Lift-Wing, 06Machine-Learning-Team: Investigate Explainer for Revert-Risk model - https://phabricator.wikimedia.org/T330131#10942264 (10Aklapper) a:05achou→03None @achou: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign t... [12:24:28] 06Machine-Learning-Team, 06Data-Engineering, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399#10942262 (10Aklapper) a:05achou→03None @achou: Removing task assignee as this open task has been assigned... [12:28:49] 06Machine-Learning-Team, 06Moderator-Tools-Team, 10PageTriage: Detection and flagging of articles that are AI/LLM-generated - https://phabricator.wikimedia.org/T330346#10942419 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigned for more than two years - See... [12:29:01] 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Add language support for Esperanto (eo) - https://phabricator.wikimedia.org/T325577#10942422 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigned for more than t... [12:29:09] 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Add language support for Serbo-Croatian - https://phabricator.wikimedia.org/T325483#10942441 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigned for more than t... [12:29:21] 07artificial-intelligence, 06Machine-Learning-Team, 10Bad-Words-Detection-System, 10revscoring: Add language support for Cantonese (yue) - https://phabricator.wikimedia.org/T312776#10942447 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigned for more than... [12:29:37] 07artificial-intelligence, 10Lift-Wing, 06Machine-Learning-Team, 07Documentation: Create a tutorial for deploying a model on toolforge - https://phabricator.wikimedia.org/T281317#10942456 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigned for more than tw... [12:29:43] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 10MoveComms-Support, 07Chinese-Sites: Support languages whose add-a-link models were not published - https://phabricator.wikimedia.org/T309263#10942452 (10Aklapper) a:05calbon→03None @calbon: Removing task assignee as this open task has been assigne... [12:33:46] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212#10942561 (10kevinbazira) 05Open→03Declined Declining this task as @OKarakaya-WMF is working on new add-a-link model training in {T39... [12:52:08] kevinbazira: nice work!! [12:52:29] \o/ [13:27:24] 06Machine-Learning-Team: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10942826 (10BWojtowicz-WMF) All of the work that has been planned for this task has been completed and merged 🎉 Some major points: - Using only ruff instead of ruff+isort+black... [13:29:47] 06Machine-Learning-Team: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10942854 (10isarantopoulos) 05Open→03Resolved Awesome, nice work! [14:00:57] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10943003 (10OKarakaya-WMF) We found that we lose redirects. This item is removed from the dataset because Novi-Sad does not exist in p... [14:12:47] klausman: parallel download looks good. I'll submit patch before the dinner. Yet to see how to deal with s3cmd config. [14:50:38] ^done [15:03:26] isaranto, aiko o/ how are we doing with the tone check slo? :) [15:06:44] hi elukey ! I need to address your latest comments...I agree with the comment about adding only 2xx in the latency SLO. [15:10:40] ooook [15:10:48] I am asking since we have this week left [15:11:03] otherwise we can create a new hypothesis for q1 etc.. [15:35:46] sorry for leaving it for the last minute. If you have time we can tackle it on Thursday/Friday [15:46:18] np! [15:52:28] whenever someone has time I'd like a review here (ores extension work) so we can deploy tomorrow, thanks! https://gerrit.wikimedia.org/r/c/1163405/ [18:02:56] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10944214 (10OKarakaya-WMF) I get high scores after the fix. Interestingly, this is higher than akhatun results though. I'll check furth... [19:49:07] (03PS3) 10AikoChou: edit-check: add metadata to model input base on env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1161574 (https://phabricator.wikimedia.org/T397013)