[12:07:17] <elukey>	 hello folks
[12:07:20] <elukey>	 :)
[12:07:48] <elukey>	 We should discuss our plans for the Feature store(s), we have to review procurement tasks for dcops this week
[12:07:56] <elukey>	 we currently planned:
[12:08:04] <elukey>	 1) 3 redis-like nodes in eqiad
[12:08:10] <elukey>	 2) 3 redis-like nodes in codfw
[12:08:29] <elukey>	 3) 2 nodes in eqiad (for the offline store, but it was in early stages, not even sure if we need those)
[12:09:02] <elukey>	 we should probably create a task to have a better idea, at least for the online part
[12:09:18] <elukey>	 for example  testing feast
[12:12:20] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey)
[14:59:33] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[14:59:35] <wikibugs>	 10Lift-Wing: Bootstrap the ml-serve-codfw cluster - https://phabricator.wikimedia.org/T294412 (10elukey) 05Open→03Resolved The codfw cluster is up and running!
[15:02:28] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) ` elukey@ml-serve1004:~$ ps -eLf | grep python | awk '{ print $2" "$10" "$11}' | uniq -c       2 1404 python3 /usr/local/bin/prometheus-nic-saturation-exporter       1 493...
[16:10:58] <accraze>	 o/
[16:12:04] <elukey>	 accraze: o/
[16:12:41] <elukey>	 I tried to check the kserve settings in the context of the load test, but there are some things that I am not sure yet
[16:13:01] <elukey>	 I was reviewing https://github.com/kserve/kserve/commit/c10e6271897d7fd058f5618d5e0e70b31496f64c, that we run now, and it seems indeed working
[16:13:04] <elukey>	 the logs say:
[16:13:06] <elukey>	 1 worker
[16:13:17] <elukey>	 5 asyncio workers (in a thread loop)
[16:13:49] <elukey>	 the 1 worker seems to be the http main thread, that I suppose also handles the ioloop (?), meanwhile the other 5 are used for blocking code
[16:14:06] <elukey>	 I ran a quick test earlier on
[16:14:06] <elukey>	 https://grafana.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1
[16:14:31] <elukey>	 siege with 50 concurrent clients, all requesting the same goodfaith rev etc..
[16:15:17] <elukey>	 it is very easy to see the latency up to seconds, with a few req/s
[16:15:35] <elukey>	 and it kinda makes sense, we are bound to calling the mw api at the current stage
[16:15:58] <accraze>	 ^^ aha! yeah that would make sense
[16:16:21] <accraze>	 we could try making the mw-api calls async?
[16:17:24] <elukey>	 I have no idea how much control we have on what goes in a coroutine and what is considered "blocking:
[16:18:04] <elukey>	 it is also to note that when we score something via the model we use cpu time
[16:18:24] <elukey>	 that is not, ideally, where ioloop etc.. shine
[16:18:55] <elukey>	 I think that we should try to investigate what are the gotchas with the kserve architecture
[16:19:01] <elukey>	 and then tune accordingly
[16:19:08] <elukey>	 does it make sense?
[16:19:19] <elukey>	 (brb)
[16:20:43] <accraze>	 agreed, we should have a better understanding of the different gotchas we run into, maybe also start a doc of them somewhere
[16:30:38] <accraze>	 hmm seems the weird pip bug on the isvc deployment pipelines is happening again (only for draftquality now)
[16:43:49] <chrisalbon>	 Morning all!
[22:09:30] <wikibugs>	 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10revscoring, 10Machine-Learning-Team (Active Tasks): Create migration plan for editquality models from ORES to Lift Wing - https://phabricator.wikimedia.org/T284689 (10ACraze)
[22:09:32] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Prepare 4 ORES English models for Lift Wing - https://phabricator.wikimedia.org/T272874 (10ACraze)
[22:09:50] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) 05Open→03Resolved Closing out this task. We have production images for predictors across all revscoring classes (editquality, articl...