[06:34:28] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11041810 (10BWojtowicz-WMF) Unfortunately, it seems that we won't be able to retrieve the exact timestamps nor the number of failed requests as they...
[06:51:17] <ozge_>	 good morning.
[07:15:43] <bartosz>	 good morning! 
[07:15:44] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking), 13Patch-For-Review: Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11041916 (10BWojtowicz-WMF) I've ran a load-test on staging cluster with 10000 requests, each of them returned a proper non-emp...
[07:21:54] <bartosz>	 Would someone have a free second to take a look at the patch bumping max-replicas for edit-check on staging to 3? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1173796 I'm investigating an issue with edit-check returning blank reponses and I want to check if it might be related to scale-ups/scale-downs, but I'd prefer testing it on staging instead of prod 
[07:51:36] <ozge_>	 +1
[07:52:55] <bartosz>	 thank you! 
[07:55:43] <aiko>	 o/ good morning! :)
[08:40:39] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Reimplement the model-upload script to take into consideration new use cases - https://phabricator.wikimedia.org/T394301#11042069 (10elukey) Setting up a virtual environment for each stat10xx may be a little pain, we don't have a good way to do that with our deploymen...
[08:56:08] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate `edit-check` returning empty responses - https://phabricator.wikimedia.org/T400606#11042092 (10BWojtowicz-WMF) I've updated the staging deployment of `edit-check` to be able to autoscale up to 3 replicas. I've re-ran the load-testin...
[09:54:08] <wikibugs>	 06Machine-Learning-Team, 06Research: Score probability evaluation for languages without enough data - https://phabricator.wikimedia.org/T398930#11042258 (10achou)
[09:54:09] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11042259 (10achou)
[09:54:11] <wikibugs>	 06Machine-Learning-Team, 05Goal: FY2024-25 Q4 Goal: Productionize tone check model - https://phabricator.wikimedia.org/T391940#11042260 (10achou)
[09:59:20] <wikibugs>	 06Machine-Learning-Team, 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11042272 (10achou)
[09:59:22] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model - https://phabricator.wikimedia.org/T398970#11042273 (10achou)
[14:10:28] <elukey>	 hi folks!
[14:12:06] <elukey>	 I am reviewing the WIP dashboard for the Edit Check's SLO (SRE is trying to make it right, our tooling is not there yet). I noticed a sudden drop of the SLO error budget around the 24th, and indeed I see p50+ latency more than 1s most of the time:
[14:12:07] <elukey>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&from=now-7d&to=now&timezone=utc&var-cluster=aWotKxQMz&var-namespace=edit-check&var-backend=$__all&var-response_code=200&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&refresh=30s
[14:12:56] <elukey>	 Ilias deployed something on the 24th related to edit check for all clusters https://sal.toolforge.org/production?p=0&q=edit-check&d=2025-07-24
[14:13:32] <elukey>	 it is not a big problem at the moment, but in the bright future this use case would cause an alert to be fired for sure :)
[14:13:53] <elukey>	 (the commitment so far for the SLO is to have 90% of the HTTP 200 responses returning in max 1s)
[14:32:18] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11043355 (10elukey) Today I tried to review the graphs in the Tone Check's latency SLO page, and this is what I found:...
[14:32:33] <elukey>	 summarized it in https://phabricator.wikimedia.org/T390706#11043355 :)
[15:00:39] <bartosz>	 o/ elukey: Looking into the Grafana dashboard it seems there was almost no traffic before the 24th. I think it's possible that we've just started receiving more traffic on ~24th as AFAIK we've started running some A/B tests with the edit check on various wikis around this time, which could explain the sudden drop in error budget
[15:02:09] <elukey>	 bartosz: okok makes sense! Do we expect the latency to improve? Otherwise we may need to review the SLO values :)
[15:03:05] <bartosz>	 Yeah I think the latency numbers are still worrying, I'm not sure if we expect them to improve
[15:04:39] <bartosz>	 what's interesting is that the total traffic graph seems like it should be totally managable, our load tests showed that each replica should handle ~15 requests per second with good latency
[15:16:31] <bartosz>	 it also seems that our deployment rarely scaled up in the last days, maybe lowering the scale-up rps threshold a little (currently set at 15rps) could help to deal with the increased load https://grafana.wikimedia.org/goto/nK_HG2wHg?orgId=1
[15:19:04] <elukey>	 it is all good since it seems that the metrics are finally making some sense, there is no expectation to have everything up and running and consistent tomorrow
[15:19:13] <elukey>	 but I am trying to follow up to improve out tooling :)
[15:22:57] <bartosz>	 I see, thank you for working on improving those!
[15:25:45] <bartosz>	 I'm still a little curious about the total traffic on istio dashboard being constantly low when our replicas scaled up based on 15rps target
[17:22:01] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 06Wikipedia-iOS-App-Backlog: [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#11044214 (10Seddon)