[07:16:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Left a comment but feel free to skip it and do it next time in case you have ideas etc.. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[07:25:33] <elukey>	 isaranto: o/ not sure if you tried or not, but in gerrit you can send chain of patches as well (that can be rebased one on top of the other)
[07:44:27] <wikibugs>	 (03CR) 10Elukey: "Left a comment but the rest LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[08:13:00] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ores-legacy: add LW error messages and exceptions to response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[08:23:07] <wikibugs>	 10Machine-Learning-Team, 10Gerrit, 10serviceops-radar, 10Language-Team (Language-2023-January-March): Create Gerrit repository for /services/machinetranslation and migrate code from Gitlab - https://phabricator.wikimedia.org/T331256 (10hashar) I have created the GitHub repo https://github.com/wikimedia/med...
[08:26:03] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 (10kevinbazira)
[08:26:25] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 (10kevinbazira) 24/24 models were trained successfully in the 13th round of wikis.
[08:36:12] <isaranto>	 TIL https://github.com/peak/s5cmd
[08:36:35] <isaranto>	 `For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli`
[09:09:59] <elukey>	 lol
[09:14:41] <elukey>	 isaranto: Giuseppe (SRE) created https://gitlab.wikimedia.org/repos/sre/sextant/-/blob/scaffold/README.md#create-a-new-chart-from-scaffolding-models, as a new way to create helm charts
[09:15:09] <elukey>	 it should allow a quicker creation of charts etc.. we could give it a try for ores-migration, what do you think?
[09:21:27] <isaranto>	 I'm lacking a lot of context as I don't understand what this solves
[09:22:59] <isaranto>	 but sure lets give it a try
[09:23:32] <isaranto>	 I'm not familiar with what exactly is the complexity of a WMF chart but I'll find out soon :D
[09:26:20] <isaranto>	 in my mind the helm chart for the ores migration is going to be quite simple. A deployment a service and an ingress
[09:26:25] <wikibugs>	 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui)
[09:43:26] <wikibugs>	 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[10:03:53] <wikibugs>	 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[10:08:44] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: ores-legacy: add LW error messages and exceptions to response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414)
[10:11:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores-legacy: add LW error messages and exceptions to response (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[10:12:09] <elukey>	 isaranto: yes yes it is not difficult, the tool is supposed to add all the scaffolding needed 
[10:12:22] <elukey>	 you select modules and it adds the functionality
[10:12:59] <elukey>	 like for example: tls-proxy to connect to lift wing (envoy etc..), istio-ingress config, deployment + annotations/labels, etc..
[10:13:22] <elukey>	 so instead of doing it manually we have some scaffolding for us, so the service configs are as homogeneus as possible
[10:13:24] <isaranto>	 ack
[10:15:07] <elukey>	 re: gpus discussion from yesteday - I was thinking that adding a GPU to staging would be fine so we can experiment etc..
[10:15:19] <elukey>	 but staging is in codfw, and the AMD GPUs that we run on hadoop nodes are in eqiad
[10:15:22] <elukey>	 sigh
[10:15:27] <elukey>	 so we'll need to buy one
[10:16:04] <elukey>	 which one? Good question :D
[10:27:54] <isaranto>	 elukey: before u deploy autoscaling stuff shall I submit a patch to try scaling to zero for 1-2 model servers?
[10:29:27] <elukey>	 isaranto: better to try it in staging
[10:29:36] <isaranto>	 ack
[10:29:39] <isaranto>	 I was checking knative docs and default `stable-window` value is 60s lol
[10:30:02] <isaranto>	 stable-window is the time period it has not received a request and then scales to zero instances
[10:30:11] <elukey>	 yes exactly
[10:30:51] <elukey>	 not sure what's best to be honest
[10:30:58] <elukey>	 maybe something in the range of hours
[10:37:44] <isaranto>	 yeah we'll see.. maybe even days in some situations
[10:38:29] <isaranto>	 allowed 2 deployments in staging to scale to zero https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/900239
[10:42:13] <elukey>	 scaling configs deployed
[11:04:54] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: Add logging for FastAPI app [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414)
[11:06:30] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: Add logging for FastAPI app (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[11:44:03] * elukey lunch!
[11:53:11] * isaranto afk for 1-2h
[13:08:46] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] build: Updating dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/899976 (owner: 10Libraryupgrader)
[13:23:26] <wikibugs>	 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[13:39:33] <isaranto>	 Back!
[14:56:54] <elukey>	 folks: the experimental namespace is now only deployed in staging
[15:51:09] <isaranto>	 noice
[15:53:25] <isaranto>	 managed to make caching work for async using aiocache package following recommendations from issues and discussions on FastAPI GH .e.g. https://github.com/tiangolo/fastapi/issues/651
[15:54:13] <elukey>	 isaranto: what kind of caching it is? 
[15:54:18] <isaranto>	 The idea is that if the endpoint is too slow for some calls that may be repeated we can use a Redis instance to cache LW calls
[15:54:40] <elukey>	 mmm if possible let's not use Redis
[15:54:43] <isaranto>	 for now I used a memory cache that brought great results
[15:55:18] <elukey>	 let's first build the first version without caching, then we think what's best etc..
[15:55:23] <isaranto>	 it has memory, Redis or memcached
[15:55:25] <elukey>	 I fear that it will complicate a lot things
[15:55:47] <elukey>	 without WME and ChangeProp the traffic to ORES is really low
[15:55:48] <isaranto>	 sure we'll build it without
[15:56:17] <isaranto>	 just wanted to have an idea how to do it and how feasible it is
[15:57:20] <elukey>	 ah yes yes +1
[15:57:39] <elukey>	 when I hear caching I am always on the defensive side, it is sooo difficult to get it right
[15:57:59] <isaranto>	 I'll just write a summary for reference on the ticket
[15:58:22] <isaranto>	 load tests were astonishing
[15:58:34] <isaranto>	 (ok ofc they would be but still...)
[15:59:25] <isaranto>	 whats the best way to keep the code for reference as well? create a patch and abandon it?
[16:11:30] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-migration: cache LiftWing calls [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/900365 (https://phabricator.wikimedia.org/T330414)
[16:14:09] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) I experimented adding a cache to the application in order to cache LiftWing calls, and we know what we need to implement and how. (alternatively...
[16:47:14] <elukey>	 created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Stages_for_a_model_on_Lift_Wing
[16:47:29] <elukey>	 it is a short paragraph but contains more-or-less the idea for experimental
[16:47:41] <elukey>	 please change/fix/etc.. it if you like
[16:49:21] <klausman>	 SOunds good to me.
[16:50:39] <elukey>	 klausman: the api-gateway limits for lift wing endpoints are now in flux, we can definitely change them if needed. 10k for anon traffic and 100k for logged in users (for every hour)
[16:51:01] <elukey>	 if you have better values please change them, I thought to use something not super restrictive as starter
[16:51:06] <klausman>	 So 10k reqs per anon IP per hour?
[16:53:00] <elukey>	 correct
[16:53:27] <elukey>	 it is around 3 rps
[16:53:31] <elukey>	 maximum
[16:53:46] <klausman>	 Eh, sounds good enough. I'd start with something like 1qps per IP, would work out as 3600 r/h/ip, so same neighborhood.
[16:54:24] <elukey>	 okok
[16:55:00] <klausman>	 How much we can actually handle, we'll have to figure out once production traffic actually shows up. Many vagaries in there, and we haven't really done any optimization either.
[16:56:31] <elukey>	 I agree, I wanted to have something in place to know 1) that worked 2) that we could easily change it etc..
[16:56:48] <elukey>	 I tested it in staging with an aiohttp script, the thresholds are respected
[16:56:50] <klausman>	 Yeah, 10k and 100k are good starting points
[16:57:15] <elukey>	 all right :)
[16:57:24] <elukey>	 I am logging off for today folks
[16:57:28] <elukey>	 have a nice rest of the day!
[16:57:36] <klausman>	 and thank you for picking up the docs part, I dropped the ball on that onme
[16:57:46] <klausman>	 \o have nice evening
[16:58:01] <elukey>	 np! I think that we can wait for other's feedback now, everything is in place
[16:58:12] <elukey>	 I need to add docs to the api portal but it seems more complicated
[16:58:16] <elukey>	 will do it during the next days
[18:34:25] <derenrich2>	 is it straightforward for a community member to use one of the LiftWing models e.g wikidatawiki-damaging? seems there's a ticket to create docs https://phabricator.wikimedia.org/T325759 ? i assume there's no docs as of yet?
[23:11:37] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 70 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)