[12:07:17] hello folks [12:07:20] :) [12:07:48] We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week [12:07:56] we currently planned: [12:08:04] 1) 3 redis-like nodes in eqiad [12:08:10] 2) 3 redis-like nodes in codfw [12:08:29] 3) 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those) [12:09:02] we should probably create a task to have a better idea, at least for the online part [12:09:18] for example testing feast [12:12:20] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) [14:59:33] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [14:59:35] 10Lift-Wing: Bootstrap the ml-serve-codfw cluster - https://phabricator.wikimedia.org/T294412 (10elukey) 05Open→03Resolved The codfw cluster is up and running! [15:02:28] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) ` elukey@ml-serve1004:~$ ps -eLf | grep python | awk '{ print $2" "$10" "$11}' | uniq -c 2 1404 python3 /usr/local/bin/prometheus-nic-saturation-exporter 1 493... [16:10:58] o/ [16:12:04] accraze: o/ [16:12:41] I tried to check the kserve settings in the context of the load test, but there are some things that I am not sure yet [16:13:01] I was reviewing https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c, that we run now, and it seems indeed working [16:13:04] the logs say: [16:13:06] 1 worker [16:13:17] 5 asyncio workers (in a thread loop) [16:13:49] the 1 worker seems to be the http main thread, that I suppose also handles the ioloop (?), meanwhile the other 5 are used for blocking code [16:14:06] I ran a quick test earlier on [16:14:06] https://grafana.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1 [16:14:31] siege with 50 concurrent clients, all requesting the same goodfaith rev etc.. [16:15:17] it is very easy to see the latency up to seconds, with a few req/s [16:15:35] and it kinda makes sense, we are bound to calling the mw api at the current stage [16:15:58] ^^ aha! yeah that would make sense [16:16:21] we could try making the mw-api calls async? [16:17:24] I have no idea how much control we have on what goes in a coroutine and what is considered "blocking: [16:18:04] it is also to note that when we score something via the model we use cpu time [16:18:24] that is not, ideally, where ioloop etc.. shine [16:18:55] I think that we should try to investigate what are the gotchas with the kserve architecture [16:19:01] and then tune accordingly [16:19:08] does it make sense? [16:19:19] (brb) [16:20:43] agreed, we should have a better understanding of the different gotchas we run into, maybe also start a doc of them somewhere [16:30:38] hmm seems the weird pip bug on the isvc deployment pipelines is happening again (only for draftquality now) [16:43:49] Morning all! [22:09:30] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10revscoring, 10Machine-Learning-Team (Active Tasks): Create migration plan for editquality models from ORES to Lift Wing - https://phabricator.wikimedia.org/T284689 (10ACraze) [22:09:32] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Prepare 4 ORES English models for Lift Wing - https://phabricator.wikimedia.org/T272874 (10ACraze) [22:09:50] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) 05Open→03Resolved Closing out this task. We have production images for predictors across all revscoring classes (editquality, articl...