[05:41:42] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Fix ORES Special page - https://phabricator.wikimedia.org/T345407 (10isarantopoulos)
[05:41:53] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Fix ORES Special page - https://phabricator.wikimedia.org/T345407 (10isarantopoulos)
[08:05:57] <isaranto>	 Good morning! Running an errand bbl!
[08:07:22] <elukey>	 o/
[08:18:30] <wikibugs>	 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10elukey) Thanks a lot for following up!  The `wme` tier is very big and created ad-hoc for the WME use case, we usually suggest to use the `internal` one (the name is not great)...
[10:00:21] <elukey>	 still trying to make knative working, sigh
[10:26:51] <elukey>	 I think that this happens:
[10:27:20] <elukey>	 1) we want to deploy a new version of the knative-serving chart, that causes all the knative control plane pods to be re-created.
[10:27:43] <isaranto>	 Do you think we should fix this before releasing the extension to big wikis or not?
[10:28:00] <elukey>	 2) if we also set a change in a config-map, like in this case, when helm applies the new changes the kube-api will in turn call the knative webhook (as it is instructed) to validate the new config
[10:28:18] <isaranto>	 I mean as far as I understand we don't have model server failures because of it
[10:28:27] <elukey>	 3) the validating webhook takes time to startup, and when it is called it causes a timeout (that fails the deployment)
[10:28:36] <elukey>	 ah nono you can proceed
[10:29:56] <elukey>	 in the past I solved this problem with a loong readiness probe setting, that got changed a while ago to be more flexible
[10:30:13] <elukey>	 very weird
[10:34:02] <elukey>	 the other thing that is odd: on staging it works all the time
[10:35:20] <elukey>	 I am trying to remove HPA Rules (cpu based autoscaling) since we don't really support them
[10:35:40] <elukey>	 yeah but nothing really changed
[10:46:47] <isaranto>	 ack
[11:01:48] <elukey>	 I managed to "fix" it, temporarily raising the webhook pods
[11:02:46] <elukey>	 lunch break! Then I'll keep going with the autoscaling stuff
[11:11:57] <wikibugs>	 10Machine-Learning-Team: ores-legacy wikidata errors - https://phabricator.wikimedia.org/T345063 (10Aklapper)
[11:24:53] <isaranto>	 Lunch break for me as well!
[13:47:14] <chrisalbon_>	 Morning all!
[13:47:29] <elukey>	 o/
[13:52:06] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) weekly update: * met with Luca and Aiko to discuss load testing and SLO T334182#9130664 * we agreed that it m...
[14:15:38] <elukey>	 deployed the new 8 rps target for revscoring-goodfaith, and I tested it with itwiki and wrk
[14:15:48] <elukey>	 instances went to up 4 (maximum) pretty quickly
[14:17:45] <elukey>	 going to deploy the rest
[14:19:48] <elukey>	 8 seems to be a good value (and 85% of it is 6.8, that is when we start scaling up)
[14:20:16] <elukey>	 I am pretty sure that we are very conservative
[14:20:25] <elukey>	 but better safe than sorry
[14:25:12] <chrisalbon_>	 8 rps target?
[14:26:53] <isaranto>	 Nice job Luca! I think we can run some tests in staging and increase it 
[14:28:23] <isaranto>	 chrisalbon_:  8 rps = requests per second . When we reach 85% of that we scale up adding one more pod
[14:28:36] <elukey>	 exactly yes, it is a knative setting
[14:28:54] <elukey>	 we are tuning it so new traffic will be handled hopefully better
[14:29:33] <elukey>	 we'll have to tune those values, and I added a specific note for model owners to help us figuring it out before hitting production (we'll do it with the readability model as first test case)
[14:34:42] <elukey>	 all deployed
[14:35:32] <chrisalbon_>	 ah got it
[14:35:59] <elukey>	 chrisalbon_: we also discussed about SLOs etc.., Research agreed that we should come up with numbers together
[14:36:16] <elukey>	 it is a mental shift of course, it takes time, but we'll all have to reason in those terms sooner or later
[14:36:48] <chrisalbon_>	 Okay cool
[14:37:13] <chrisalbon_>	 That work can also fall on the PMs for the features powered by the model, but that will likely flow through the model makers (i.e. researchers)
[14:37:37] * elukey nods
[14:38:19] <elukey>	 for example, if a model hosted on Lift Wing needs to be used on Wikipedia with some reliability expectations, we'll have to know
[14:38:45] <elukey>	 one service cannot set SLO X if one of its dependencies offers an SLO with lower value than X
[14:38:51] <chrisalbon_>	 yeah
[14:42:33] <wikibugs>	 10Machine-Learning-Team: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) We decided to switch to `rps` as metric to scale up or down (as opposed to `concurrency`), since it seems more inline with other metrics like Istio RPSes etc.. We have set `8` as target...
[14:55:24] <isaranto>	 I still haven't managed to fix the ores special page .. 
[14:55:49] <isaranto>	 Logging off for the weekend folks
[14:55:53] <isaranto>	 o/
[14:58:28] <elukey>	 o/
[15:04:29] <elukey>	 going afk for the weekend as well folks!
[15:22:35] <chrisalbon_>	 night all!
[15:49:39] <wikibugs>	 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10Ragesoss) `internal` might be fine, but since it's only 2x the default limit which we were already hitting, I thought `wme` would be safer.  The use case is that we pull data f...