[06:34:28] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11041810 (10BWojtowicz-WMF) Unfortunately, it seems that we won't be able to retrieve the exact timestamps nor the number of failed requests as they... [06:51:17] good morning. [07:15:43] good morning! [07:15:44] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking), 13Patch-For-Review: Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11041916 (10BWojtowicz-WMF) I've ran a load-test on staging cluster with 10000 requests, each of them returned a proper non-emp... [07:21:54] Would someone have a free second to take a look at the patch bumping max-replicas for edit-check on staging to 3? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1173796 I'm investigating an issue with edit-check returning blank reponses and I want to check if it might be related to scale-ups/scale-downs, but I'd prefer testing it on staging instead of prod [07:51:36] +1 [07:52:55] thank you! [07:55:43] o/ good morning! :) [08:40:39] 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11042069 (10elukey) Setting up a virtual environment for each stat10xx may be a little pain, we don't have a good way to do that with our deploymen... [08:56:08] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11042092 (10BWojtowicz-WMF) I've updated the staging deployment of `edit-check` to be able to autoscale up to 3 replicas. I've re-ran the load-testin... [09:54:08] 06Machine-Learning-Team, 06Research: Score probability evaluation for languages without enough data - https://phabricator.wikimedia.org/T398930#11042258 (10achou) [09:54:09] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11042259 (10achou) [09:54:11] 06Machine-Learning-Team, 05Goal: FY2024-25 Q4 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#11042260 (10achou) [09:59:20] 06Machine-Learning-Team, 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11042272 (10achou) [09:59:22] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11042273 (10achou) [14:10:28] hi folks! [14:12:06] I am reviewing the WIP dashboard for the Edit Check's SLO (SRE is trying to make it right, our tooling is not there yet). I noticed a sudden drop of the SLO error budget around the 24th, and indeed I see p50+ latency more than 1s most of the time: [14:12:07] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&from=now-7d&to=now&timezone=utc&var-cluster=aWotKxQMz&var-namespace=edit-check&var-backend=$__all&var-response_code=200&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&refresh=30s [14:12:56] Ilias deployed something on the 24th related to edit check for all clusters https://sal.toolforge.org/production?p=0&q=edit-check&d=2025-07-24 [14:13:32] it is not a big problem at the moment, but in the bright future this use case would cause an alert to be fired for sure :) [14:13:53] (the commitment so far for the SLO is to have 90% of the HTTP 200 responses returning in max 1s) [14:32:18] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11043355 (10elukey) Today I tried to review the graphs in the Tone Check's latency SLO page, and this is what I found:... [14:32:33] summarized it in https://phabricator.wikimedia.org/T390706#11043355 :) [15:00:39] o/ elukey: Looking into the Grafana dashboard it seems there was almost no traffic before the 24th. I think it's possible that we've just started receiving more traffic on ~24th as AFAIK we've started running some A/B tests with the edit check on various wikis around this time, which could explain the sudden drop in error budget [15:02:09] bartosz: okok makes sense! Do we expect the latency to improve? Otherwise we may need to review the SLO values :) [15:03:05] Yeah I think the latency numbers are still worrying, I'm not sure if we expect them to improve [15:04:39] what's interesting is that the total traffic graph seems like it should be totally managable, our load tests showed that each replica should handle ~15 requests per second with good latency [15:16:31] it also seems that our deployment rarely scaled up in the last days, maybe lowering the scale-up rps threshold a little (currently set at 15rps) could help to deal with the increased load https://grafana.wikimedia.org/goto/nK_HG2wHg?orgId=1 [15:19:04] it is all good since it seems that the metrics are finally making some sense, there is no expectation to have everything up and running and consistent tomorrow [15:19:13] but I am trying to follow up to improve out tooling :) [15:22:57] I see, thank you for working on improving those! [15:25:45] I'm still a little curious about the total traffic on istio dashboard being constantly low when our replicas scaled up based on 15rps target [17:22:01] 06Machine-Learning-Team, 06collaboration-services, 06Wikipedia-iOS-App-Backlog: [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#11044214 (10Seddon)