[05:41:10] Good morning! [05:50:31] I ran some load tests increasing the uvicorn workers for the model servers e.g. `kserve.ModelServer(workers=2).start([model])`and it scales better than our async implementation. ofc it is a different type of scaling since it requires more cpus, but I was thinking that it can be a good option to do both (multiprocessing and horizontal scaling) for model servers that have high latencies [05:51:05] I'll post some results on phabricator [06:05:44] I was wondering if we had experimented with uvicorn mp in the past though, I haven't found sth [06:32:15] hello folks! [06:33:46] isaranto: o/ IIRC we tried the multiple workers before fastapi/0.10 [06:34:02] but it wasn't incredibly performant [06:34:18] if we have good perfs now we can definitely think about it for revscoring models [06:34:26] could be a good test case [06:35:13] one thing that I'd need to do (and maybe we need kube-state-metrics etc.. deployed in our cluster) is to figure out how we can visualize how many cpus and memory are already used in the cluster [06:35:47] it is not straightforward to get it now (there are some kubectl node commands but..) and having a clear view will tell us what to do in the future [06:36:08] ack [06:36:12] say for example that we want to increase workers from 1 to 4 to the goodfaith pods - do we have enough capacity, scaling capabilities included? [06:36:22] at the moment I wouldn't be able to answer [06:38:53] hmm although throughput is increased I am seeing some socket errors which aren't happening with async [06:39:09] need to check how socket errors are defined in wrk [06:39:10] (03PS5) 10Elukey: events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [06:40:03] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revscoring: fix WIKI_URL validation [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 (owner: 10Ilias Sarantopoulos) [06:42:01] (03CR) 10Ilias Sarantopoulos: [C: 03+1] events: drop support for /mediawiki/revision/create#1.x events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930665 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse) [06:47:20] (03Merged) 10jenkins-bot: revscoring: fix WIKI_URL validation [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 (owner: 10Ilias Sarantopoulos) [06:50:59] isaranto: https://github.com/kserve/kserve/blob/master/python/kserve/kserve/model_server.py#L173 [06:51:22] the implementation and comments suggest that the kserve workers are not super clean [06:53:45] another thing that we could check is Ray workers [06:53:53] we should have native support via annotations [06:54:08] Aiko tested it a long time ago but we didn't know all the use cases etc.. that we know now [06:54:30] but IIRC it follows our MP model [06:55:57] ok, I think we can open a set of tasks with things that we can explore over time, so that we dont spend all our time on it, but run some tests every now and then and draw conclusions over time [06:56:08] I mean so that all the other stuff we are doing arent blocked [06:56:19] * isaranto running an errand bbl [08:04:54] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [08:28:35] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) a:03elukey [08:29:46] a! logging format, I totally forgot it ! [08:31:08] I'll work on that elukey: [08:31:20] I have a patch I never submitted [08:35:19] okok! [08:35:23] going to run an errand, bbl [10:18:59] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) Thanks @isarantopoulos and @Isaac for the suggestions! @elukey advised that we limit the recommendation-api resources to 2 uwsgi workers and 2 cpus. We implemented thi... [10:42:17] quick review if anybody has time - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964456 [10:48:11] anything to get me away from change 963683 :D [10:52:38] * isaranto lunch! [10:52:50] klausman: morning! What is change 963683 ? [10:53:02] deleteing ORES bits from Puppet [10:53:31] I'm not planning on submitting that wholesale, but more like a grab bag that I can then derive smaller self-contained changes from [10:53:53] And then once all those are done, I can use it to see if I missed anything [10:54:20] With Puppet, making self-contained changes is tricky since evrything references everything, and PCC must be run against a concrete machine. [10:54:43] And then there's those bits where after removal, some service needs to be restarted/deployed etc. [10:57:25] Also, lunch [10:57:45] you can also use Hosts: Auto with PCC [10:57:51] that automatically selects etc.. [10:57:58] oh, neat, thanks! [11:08:50] but I'd suggest to clean up deployment-prep first [11:08:54] and all the stuff in cloud [11:08:59] so we avoid errors etc.. [11:23:53] aye. getting an overview first is the purpose of that change [11:24:06] * elukey lunch! [12:56:02] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [12:58:36] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) I could see time-outs being related to resources but the memory usage of the app should be incredibly low so I wouldn't expect resources to be connected to other types of err... [13:13:41] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) @kevinbazira on Lift Wing we also have envoy proxy as TLS terminator, that handles TLS traffic on port 8443 and proxies to 8080 (that is the port bound by uwsgi). There is p... [13:19:30] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) > @Isaac do you reckon if we could use multi-threading instead of multiprocessing? Are those all HTTP-like calls (hence preemptable) or are we talking about cpu bound code? I... [13:52:11] * isaranto short break [14:18:50] ok adding the new metrics is way more complicated than expected [14:19:16] we already use a prometheus.io/scrape annotation, and IIUC from our prometheus scraping config, we allow only one of that for each pod [14:19:47] but the serviceops implementation of services also uses envoyproxy.io/scrape: true [14:19:57] that is configured [14:20:06] and we have kserve.prometheus.io [14:21:03] so it should probably be a matter of adding support for it [14:33:54] ah wow! https://github.com/kserve/kserve/blob/master/qpext/README.md [14:37:19] ah but it needs a new container [14:37:20] mmmm [14:40:21] 23 [14:49:07] 23? [14:52:58] irssi window, my bad :) [14:53:28] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/964551 [14:53:32] hopefully it should work [15:13:46] (03PS1) 10Ilias Sarantopoulos: revscoring: add proper logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964556 [15:16:27] (03PS6) 10Ilias Sarantopoulos: revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) [15:27:38] pls don't pay attention to the logging patch. I plan to have a new one rebased on top of the local runs patch [15:28:31] if anyone has time to review this patch this week it would be great https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/963367 [15:29:45] elukey: I saw the patch in puppet but my review doesn't have to offer anything :( [15:29:53] 10Machine-Learning-Team: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) p:05Triage→03Medium [15:29:55] it helps me to understand though so I do check these [15:30:17] isaranto: thanks! Don't worry I'll need to wait folks from serviceops [15:30:28] I am learning as well, didn't know about these things [15:31:16] 10Machine-Learning-Team, 10Patch-For-Review: Visualize KServe latency metrics in a dashboard - https://phabricator.wikimedia.org/T348456 (10elukey) a:03elukey [15:31:34] 10Machine-Learning-Team, 10Epic: Add meaningful access logs to KServe's pods - https://phabricator.wikimedia.org/T333804 (10elukey) a:05elukey→03None [15:35:36] klausman: o/ [15:35:48] I think that we should split https://phabricator.wikimedia.org/T348144 into two, one for codfw and one for eqiad [15:36:00] and at this point I think that the puppet part is done [15:36:12] the clean up that you are doing shouldn't matter for dcops [15:37:46] 10Machine-Learning-Team, 10decommission-hardware, 10ops-eqiad: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10elukey) [15:38:40] 10Machine-Learning-Team, 10decommission-hardware, 10ops-codfw: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10elukey) [15:38:42] done :) [15:40:02] 10Machine-Learning-Team, 10decommission-hardware, 10ops-codfw: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10elukey) The decom cookbook was run as part of T348144 [15:41:20] 10Machine-Learning-Team, 10Goal: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10elukey) [15:41:36] 10Machine-Learning-Team, 10Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10elukey) [15:41:49] 10Machine-Learning-Team, 10Goal: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10elukey) [15:42:03] 10Machine-Learning-Team, 10Goal: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10elukey) [15:42:13] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) a:05calbon→03elukey [15:49:55] elukey: will do the split tomorrow :) [15:50:00] hmm logging now works! [15:51:33] a no it is the same , but it works ok [15:57:06] (03PS1) 10Elukey: revert-risk: upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964559 (https://phabricator.wikimedia.org/T347550) [16:00:55] (03PS1) 10Elukey: readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 [16:03:42] 🔥 [16:04:17] (03CR) 10CI reject: [V: 04-1] readability: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey) [16:04:32] ah snap [16:06:42] ah lovely - readability 0.1.0 depends on numpy==1.23.0 [16:30:53] (03CR) 10Elukey: "started https://gitlab.wikimedia.org/trokhymovych/readability-liftwing/-/merge_requests/1" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964561 (owner: 10Elukey) [16:31:09] going afk for today folks! [16:31:15] have a nice rest of the day :) [16:44:17] o/ [16:44:32] I'm lost in the logging world. will go afk as well in a bit! [16:44:55] (03PS1) 10Ilias Sarantopoulos: revscoring: kserve logs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) [17:00:55] 10Machine-Learning-Team: Investigate which model (or family of models) could be deployed - https://phabricator.wikimedia.org/T348468 (10isarantopoulos) [20:22:59] Hey folks! Thanks, everyone who offered to help with migrating to Lift Wing. We're running into one more problem, getting through CORS when making LW request from client JS: https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/5501#issuecomment-1753660940 [20:23:31] the usual trick for other Wikimedia APIs, of adding `origin=*`, doesn't seem to be working with LW.