[06:52:48] good morning. [06:53:28] mooorning! [07:04:51] good morning folks [07:08:17] good morning :) [07:54:06] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151061 (10gkyziridis) >>! In T403378#11147585, @elukey wrote: > A simple and effective debug strategy could be do add logging about the payload received from the client, so that coupling high late... [08:16:54] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151121 (10elukey) >>! In T403378#11151061, @gkyziridis wrote: > > I feel that it is time to experiment with the GPUs as well and see if we still have high latencies. > We can be sure for the follo... [08:23:02] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11151147 (10BWojtowicz-WMF) Thank you for the discussion @Ottomata and @Eevans! I think I'm leaning more into storing all predictions under the key of `w... [08:25:34] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.08.16 - 2025.09.05), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11151194 (10kevinbazira) Following T396495#10941826 and the improvements @brouberol has hel... [08:25:44] ^--- our ml pipeline now has DAGs that can: [08:25:44] 1. extract data from the data lake, transform it, and save it to HDFS [08:25:44] 2. copy training data from HDFS to PVC [08:25:44] 3. mount PVC to training pod, access GPUs, and run model training [08:25:44] I will liaise with George and Aiko to apply similar patterns as these example DAGs to the tone-check pipelines. [08:41:56] \o/ this is great kevinbazira ! [08:45:20] Thank you @kevinbazira [08:54:09] kevinbazira: niceeee! \o/ [08:58:12] 👏 [09:17:38] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151376 (10isarantopoulos) @gkyziridis thanks for providing the updates load tests and the graphs! The issue we experienced with autoscaling in production was that pods scaled up although we never... [10:02:12] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151425 (10gkyziridis) @isarantopoulos I totally agree with this plan. 1. Log input string lengths (sum) 2. Monitor the resources find correlations on input size and resources 3. Enable GPUs on e... [10:14:09] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151434 (10isarantopoulos) The load tests should be focused on single instance requests instead of multiple ones as this is the way that requests are sent in prod as well. Batching does happen usin... [10:34:14] bartosz: a couple of days ago me and you discussed about merging the transformer & predictor pods in the outlink articletopic model into one, wdyt about this, shall we pursue this before adding the cache? [10:36:26] aiko: since you were the one to implement this service and follow the transformer/predictor pattern wdyt about this? the main argument was that the deployment will be much simpler (1 pod -- easier to configure resources) and it might even help with manipulating the cache entries (passing data from one function to the other instead of one pod to another) [10:37:22] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11151500 (10gkyziridis) > We should already be able to report on pod resources. Is this the time when you ran the load tests https://grafana.wikimedia.org/goto/bTPrpo9Hg?orgId=1? It shows some incre... [10:45:20] o/ isaranto: It's a good point, I think pursuing this before adding cache could simplify the cache code for articletopic model, because currently both the predictor&transformer depend on the cache and have to import it [11:29:36] I agree. I think the transformer/predictor pattern somewhat made dev and deployment more complex, which is why we don't use them in our other models. while there are some advantages to using this pattern, like we can potentially reuse the transformer for different models if they have the same data preprocessing.. but for most of our models, I think keeping it simpler is good [12:02:19] alright, let's do this then! [13:51:27] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11152043 (10Eevans) >>! In T401778#11151147, @BWojtowicz-WMF wrote: > Thank you for the discussion @Ottomata and @Eevans! > > I think I'm leaning more in... [13:56:15] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11152049 (10achou) @MGerlach @diego @fkaelin Wow thank you for all your input! <... [14:30:15] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.09.05 - 2025.09.26), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11152344 (10Gehel) [15:05:41] 06Machine-Learning-Team, 07Essential-Work: Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11152490 (10Gehel) [15:12:44] FIRING: LiftWingServiceErrorRate: ... [15:12:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:27:44] RESOLVED: LiftWingServiceErrorRate: ... [15:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:35:38] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11152710 (10fkaelin) I agree with @BWojtowicz-WMF and prefer storing all predictions in a single value. The thresholds is "external" to the model predicti...