[05:41:42] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Fix ORES Special page - https://phabricator.wikimedia.org/T345407 (10isarantopoulos) [05:41:53] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Fix ORES Special page - https://phabricator.wikimedia.org/T345407 (10isarantopoulos) [08:05:57] Good morning! Running an errand bbl! [08:07:22] o/ [08:18:30] 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10elukey) Thanks a lot for following up! The `wme` tier is very big and created ad-hoc for the WME use case, we usually suggest to use the `internal` one (the name is not great)... [10:00:21] still trying to make knative working, sigh [10:26:51] I think that this happens: [10:27:20] 1) we want to deploy a new version of the knative-serving chart, that causes all the knative control plane pods to be re-created. [10:27:43] Do you think we should fix this before releasing the extension to big wikis or not? [10:28:00] 2) if we also set a change in a config-map, like in this case, when helm applies the new changes the kube-api will in turn call the knative webhook (as it is instructed) to validate the new config [10:28:18] I mean as far as I understand we don't have model server failures because of it [10:28:27] 3) the validating webhook takes time to startup, and when it is called it causes a timeout (that fails the deployment) [10:28:36] ah nono you can proceed [10:29:56] in the past I solved this problem with a loong readiness probe setting, that got changed a while ago to be more flexible [10:30:13] very weird [10:34:02] the other thing that is odd: on staging it works all the time [10:35:20] I am trying to remove HPA Rules (cpu based autoscaling) since we don't really support them [10:35:40] yeah but nothing really changed [10:46:47] ack [11:01:48] I managed to "fix" it, temporarily raising the webhook pods [11:02:46] lunch break! Then I'll keep going with the autoscaling stuff [11:11:57] 10Machine-Learning-Team: ores-legacy wikidata errors - https://phabricator.wikimedia.org/T345063 (10Aklapper) [11:24:53] Lunch break for me as well! [13:47:14] Morning all! [13:47:29] o/ [13:52:06] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) weekly update: * met with Luca and Aiko to discuss load testing and SLO T334182#9130664 * we agreed that it m... [14:15:38] deployed the new 8 rps target for revscoring-goodfaith, and I tested it with itwiki and wrk [14:15:48] instances went to up 4 (maximum) pretty quickly [14:17:45] going to deploy the rest [14:19:48] 8 seems to be a good value (and 85% of it is 6.8, that is when we start scaling up) [14:20:16] I am pretty sure that we are very conservative [14:20:25] but better safe than sorry [14:25:12] 8 rps target? [14:26:53] Nice job Luca! I think we can run some tests in staging and increase it [14:28:23] chrisalbon_: 8 rps = requests per second . When we reach 85% of that we scale up adding one more pod [14:28:36] exactly yes, it is a knative setting [14:28:54] we are tuning it so new traffic will be handled hopefully better [14:29:33] we'll have to tune those values, and I added a specific note for model owners to help us figuring it out before hitting production (we'll do it with the readability model as first test case) [14:34:42] all deployed [14:35:32] ah got it [14:35:59] chrisalbon_: we also discussed about SLOs etc.., Research agreed that we should come up with numbers together [14:36:16] it is a mental shift of course, it takes time, but we'll all have to reason in those terms sooner or later [14:36:48] Okay cool [14:37:13] That work can also fall on the PMs for the features powered by the model, but that will likely flow through the model makers (i.e. researchers) [14:37:37] * elukey nods [14:38:19] for example, if a model hosted on Lift Wing needs to be used on Wikipedia with some reliability expectations, we'll have to know [14:38:45] one service cannot set SLO X if one of its dependencies offers an SLO with lower value than X [14:38:51] yeah [14:42:33] 10Machine-Learning-Team: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) We decided to switch to `rps` as metric to scale up or down (as opposed to `concurrency`), since it seems more inline with other metrics like Istio RPSes etc.. We have set `8` as target... [14:55:24] I still haven't managed to fix the ores special page .. [14:55:49] Logging off for the weekend folks [14:55:53] o/ [14:58:28] o/ [15:04:29] going afk for the weekend as well folks! [15:22:35] night all! [15:49:39] 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10Ragesoss) `internal` might be fine, but since it's only 2x the default limit which we were already hitting, I thought `wme` would be safer. The use case is that we pull data f...