[07:16:26] (03CR) 10Elukey: [C: 03+1] "Left a comment but feel free to skip it and do it next time in case you have ideas etc.. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:25:33] isaranto: o/ not sure if you tried or not, but in gerrit you can send chain of patches as well (that can be rebased one on top of the other) [07:44:27] (03CR) 10Elukey: "Left a comment but the rest LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:13:00] (03CR) 10Kevin Bazira: [C: 03+1] ores-legacy: add LW error messages and exceptions to response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [08:23:07] 10Machine-Learning-Team, 10Gerrit, 10serviceops-radar, 10Language-Team (Language-2023-January-March): Create Gerrit repository for /services/machinetranslation and migrate code from Gitlab - https://phabricator.wikimedia.org/T331256 (10hashar) I have created the GitHub repo https://github.com/wikimedia/med... [08:26:03] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 (10kevinbazira) [08:26:25] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 (10kevinbazira) 24/24 models were trained successfully in the 13th round of wikis. [08:36:12] TIL https://github.com/peak/s5cmd [08:36:35] `For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli` [09:09:59] lol [09:14:41] isaranto: Giuseppe (SRE) created https://gitlab.wikimedia.org/repos/sre/sextant/-/blob/scaffold/README.md#create-a-new-chart-from-scaffolding-models, as a new way to create helm charts [09:15:09] it should allow a quicker creation of charts etc.. we could give it a try for ores-migration, what do you think? [09:21:27] I'm lacking a lot of context as I don't understand what this solves [09:22:59] but sure lets give it a try [09:23:32] I'm not familiar with what exactly is the complexity of a WMF chart but I'll find out soon :D [09:26:20] in my mind the helm chart for the ores migration is going to be quite simple. A deployment a service and an ingress [09:26:25] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [09:43:26] 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [10:03:53] 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [10:08:44] (03PS8) 10Ilias Sarantopoulos: ores-legacy: add LW error messages and exceptions to response [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) [10:11:21] (03CR) 10Ilias Sarantopoulos: ores-legacy: add LW error messages and exceptions to response (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/897910 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [10:12:09] isaranto: yes yes it is not difficult, the tool is supposed to add all the scaffolding needed [10:12:22] you select modules and it adds the functionality [10:12:59] like for example: tls-proxy to connect to lift wing (envoy etc..), istio-ingress config, deployment + annotations/labels, etc.. [10:13:22] so instead of doing it manually we have some scaffolding for us, so the service configs are as homogeneus as possible [10:13:24] ack [10:15:07] re: gpus discussion from yesteday - I was thinking that adding a GPU to staging would be fine so we can experiment etc.. [10:15:19] but staging is in codfw, and the AMD GPUs that we run on hadoop nodes are in eqiad [10:15:22] sigh [10:15:27] so we'll need to buy one [10:16:04] which one? Good question :D [10:27:54] elukey: before u deploy autoscaling stuff shall I submit a patch to try scaling to zero for 1-2 model servers? [10:29:27] isaranto: better to try it in staging [10:29:36] ack [10:29:39] I was checking knative docs and default `stable-window` value is 60s lol [10:30:02] stable-window is the time period it has not received a request and then scales to zero instances [10:30:11] yes exactly [10:30:51] not sure what's best to be honest [10:30:58] maybe something in the range of hours [10:37:44] yeah we'll see.. maybe even days in some situations [10:38:29] allowed 2 deployments in staging to scale to zero https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/900239 [10:42:13] scaling configs deployed [11:04:54] (03PS8) 10Ilias Sarantopoulos: Add logging for FastAPI app [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414) [11:06:30] (03CR) 10Ilias Sarantopoulos: Add logging for FastAPI app (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/898780 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [11:44:03] * elukey lunch! [11:53:11] * isaranto afk for 1-2h [13:08:46] (03CR) 10Jforrester: [C: 03+2] build: Updating dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/899976 (owner: 10Libraryupgrader) [13:23:26] 10Machine-Learning-Team, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [13:39:33] Back! [14:56:54] folks: the experimental namespace is now only deployed in staging [15:51:09] noice [15:53:25] managed to make caching work for async using aiocache package following recommendations from issues and discussions on FastAPI GH .e.g. https://github.com/tiangolo/fastapi/issues/651 [15:54:13] isaranto: what kind of caching it is? [15:54:18] The idea is that if the endpoint is too slow for some calls that may be repeated we can use a Redis instance to cache LW calls [15:54:40] mmm if possible let's not use Redis [15:54:43] for now I used a memory cache that brought great results [15:55:18] let's first build the first version without caching, then we think what's best etc.. [15:55:23] it has memory, Redis or memcached [15:55:25] I fear that it will complicate a lot things [15:55:47] without WME and ChangeProp the traffic to ORES is really low [15:55:48] sure we'll build it without [15:56:17] just wanted to have an idea how to do it and how feasible it is [15:57:20] ah yes yes +1 [15:57:39] when I hear caching I am always on the defensive side, it is sooo difficult to get it right [15:57:59] I'll just write a summary for reference on the ticket [15:58:22] load tests were astonishing [15:58:34] (ok ofc they would be but still...) [15:59:25] whats the best way to keep the code for reference as well? create a patch and abandon it? [16:11:30] (03PS1) 10Ilias Sarantopoulos: ores-migration: cache LiftWing calls [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/900365 (https://phabricator.wikimedia.org/T330414) [16:14:09] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) I experimented adding a cache to the application in order to cache LiftWing calls, and we know what we need to implement and how. (alternatively... [16:47:14] created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Stages_for_a_model_on_Lift_Wing [16:47:29] it is a short paragraph but contains more-or-less the idea for experimental [16:47:41] please change/fix/etc.. it if you like [16:49:21] SOunds good to me. [16:50:39] klausman: the api-gateway limits for lift wing endpoints are now in flux, we can definitely change them if needed. 10k for anon traffic and 100k for logged in users (for every hour) [16:51:01] if you have better values please change them, I thought to use something not super restrictive as starter [16:51:06] So 10k reqs per anon IP per hour? [16:53:00] correct [16:53:27] it is around 3 rps [16:53:31] maximum [16:53:46] Eh, sounds good enough. I'd start with something like 1qps per IP, would work out as 3600 r/h/ip, so same neighborhood. [16:54:24] okok [16:55:00] How much we can actually handle, we'll have to figure out once production traffic actually shows up. Many vagaries in there, and we haven't really done any optimization either. [16:56:31] I agree, I wanted to have something in place to know 1) that worked 2) that we could easily change it etc.. [16:56:48] I tested it in staging with an aiohttp script, the thresholds are respected [16:56:50] Yeah, 10k and 100k are good starting points [16:57:15] all right :) [16:57:24] I am logging off for today folks [16:57:28] have a nice rest of the day! [16:57:36] and thank you for picking up the docs part, I dropped the ball on that onme [16:57:46] \o have nice evening [16:58:01] np! I think that we can wait for other's feedback now, everything is in place [16:58:12] I need to add docs to the api portal but it seems more complicated [16:58:16] will do it during the next days [18:34:25] is it straightforward for a community member to use one of the LiftWing models e.g wikidatawiki-damaging? seems there's a ticket to create docs https://phabricator.wikimedia.org/T325759 ? i assume there's no docs as of yet? [23:11:37] 10Machine-Learning-Team, 10ORES, 10Advanced-Search, 10All-and-every-Wikisource, and 70 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson)