[06:50:52] hello folks [07:49:27] 10Machine-Learning-Team: Add new syntax directive to blubber.yaml files to enable users to directly use docker build with blubber.yam. - https://phabricator.wikimedia.org/T322006 (10kevinbazira) [07:52:18] 10Machine-Learning-Team: Add new syntax directive to blubber.yaml files to enable users to directly use docker build with blubber.yaml. - https://phabricator.wikimedia.org/T322006 (10kevinbazira) [09:02:46] (03PS1) 10Elukey: Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) [09:11:29] Morning! [09:11:35] good morning :) [09:12:07] IF you want brainbounce or review help with the MP stuff, ping me :) [09:13:08] sure [09:13:26] the idea is to reduce the footprint of MP for the moment, the performances are not great [09:13:50] I want to add some metrics about timings of running a function inside the process pool [09:13:55] So working code we have, but it's too intertwined with the non-mp code? [09:14:19] I suspect that we may incurr into a overhead that is bigger than the benefit in our case (or maybe only in revscoring's case) [09:14:39] Does the current setup pre-fork processes? [09:14:41] the code+data needs to pickled when passed to the external process etc.. [09:15:03] it creates them upon first usage afaics from the container [09:15:19] It might be worth it to look at something that is faster for (de)serializing if that step turns out to be the expensive bit. [09:16:21] I am not doing the pickle, it is automagically done by the run_in_executor bits [09:16:33] see https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor [09:16:43] mh. [09:17:06] it is nice, the programming overhead is basically zero [09:17:26] but I found some weird corner cases, like this one [09:17:27] concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore [09:17:48] I was able to repro, if a process in the pool dies then the whole pool needs to be re-created [09:18:04] (I simply killed the process inside the container with nsenter on my laptop) [09:18:10] yeah, using processes instead of threads might be too heavyweight if we want to do it in the context of asyncio [09:18:41] the main issue is that threads don't help with cpu-bound code [09:18:50] My past use cases were all using just the stdlib multiprocessing module, and I could get away with no pickling etc. [09:18:52] since they use the GIL, so only one can run at the same time [09:19:06] with Python? [09:19:17] COuld we separate the CPU stuff into processes, and the rest into threads? [09:19:20] Yes, Python [09:20:46] The simplest thing I had was a massively-parallel ping, that used multiprocess to run many ping binaries at the same time, with surprisingly little overhead. Running 1000 pings in parallel on a mediocre machine as of ca. 2012 was fine. [09:21:32] (when you're rebooting O(20k) machines as part of maintenance, you need a quick way to check if everything is bakc :)) [09:21:47] yes but ping is a little different from what we are doing now :) [09:22:12] about the CPU stuff vs threads [09:22:29] asyncio already offers a thread pool, but it is for blocking I/O code basically [09:22:56] And I suspect we can't hybridize? [09:22:56] the run_in_executor on a process pool is meant to run only the cpu-bound code of model score [09:23:07] no no we are already doing it [09:23:14] Ah, I see. [09:23:17] but so far the performances are not great afaics [09:23:46] How hard would it be to profile the whole thing using something like pprof? [09:24:05] Just so we can be sure where the time is spent [09:24:45] I am not an expert in profiling python code, the simple thing that I added to the code is a decorator that logs time spent [09:24:54] should be easy to use [09:25:01] yup [09:25:16] on my laptop the time is dominated by preprocess, little time is spent on process (like 200ms) [09:25:48] so I want to go back to a little testbed only, with a way to turn on/off MP when the pod starts up [09:25:54] and make more precise tests [09:26:02] If the picture is still muddy, we can look at cProfile [09:26:19] maybe running cpu-bound code of revscoring is fine on asyncio, compared to the overhead of MP [09:26:31] ack. [09:26:39] https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference is also another approache [09:26:42] *approach [09:26:50] with Ray workers, Aiko looked into it a while ago [09:27:09] but the overhead is even bigger - http servers, grpc servers, etc.. [09:27:38] It would avoid the GIL, though [09:27:49] sure like MP [09:28:11] I think Ray workers are something we _might_ try if we run out of easier options. [09:28:17] my aim is to provide a generic way in the code to test/add MP if needed, with little overhead [09:28:26] But first, we need to be sure were we spend time currently. [09:28:39] Yeah, making it modular is a good idea either way [09:29:00] In the future, there might be more options, and being able to test them quickly will be beneficial [09:41:26] I think that we have a better understanding of asyncio+tornado at the moment, but it is still too new to figure out what's best [09:43:51] (03PS2) 10Elukey: Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) [09:44:08] also handled the BrokePoolExecutor exception that I was talking before [10:32:48] (03CR) 10Klausman: "Two small things, otherwise LGTM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [11:08:13] (03CR) 10Elukey: "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [11:13:00] early lunch! ttl [11:31:06] (03CR) 10Klausman: Refactor multi-processing code in revscoring model servers (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [11:45:17] Heading for lunch as well [11:50:11] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10achou) Something interesting I found when I ran benthos for outlink isvc. In the kserve logs, I found sudden spikes similar to what Luca observed in the revscoring models in the outl... [11:57:05] elukey: ---^ some observations for outlink model [12:34:05] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10diego) Yeah! Thanks @achou ! Please, can you write here an example of how to hit the endpoint ? [13:03:52] Morning all! [13:41:11] (03CR) 10AikoChou: Refactor multi-processing code in revscoring model servers (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:48:12] o/ morning Chris :) [14:19:28] hello hello [14:20:15] aiko: nice! [14:20:40] so the outlink cpu time is very little afaics, really nice [14:20:58] (03CR) 10Elukey: Refactor multi-processing code in revscoring model servers (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:21:43] (03PS3) 10Elukey: Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) [14:25:07] (03CR) 10AikoChou: [C: 03+1] Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:26:18] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) >>! In T320374#8356201, @achou wrote: > I'm not sure what's causing the 400 for these requests as there aren't any other logs and error messages. I'm wondering if I could set t... [14:26:29] aiko: answered in the task, lemme know your thoughts when you have time :) [14:34:09] klausman: deployed coredns 1.8.7 to staging, so far all good :) [14:37:51] fingers crossed :) [14:38:17] I should've +1'd that change earlier, apologies [14:39:30] oh, btw everyone: tomorrow's a public holiday in (parts of) Switzerland, so I'll be out [14:40:42] me too! [14:42:06] klausman: metric changed their names, but the rest looks good! Going to fix the dns dashboard to support both. In theory this is the version we'll run with k8s 1.23 [14:44:42] ack. [14:47:05] (03CR) 10Klausman: [C: 03+1] Refactor multi-processing code in revscoring model servers (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:54:25] one weird thing is a metric that I don't see anymore, cache_size [14:57:05] (03CR) 10Elukey: [C: 03+2] Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:01:58] (03Merged) 10jenkins-bot: Refactor multi-processing code in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/850986 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:18:31] all good, coredns 1.8.7 looks working nicely in staging :) [15:18:42] I'll wait few days and then I'll move to prod. [15:19:06] ack. Any visible changes in performance? [15:20:19] so far everything seems inline, maybe a slight improvement in cpu usage, but not that much [15:25:30] how is it going with postgres?? [15:26:07] Going well, just fighting a bit with the authentication system in Postgres. Nothing I won't figure out by tomorrow [15:27:06] ack lemme know if you need help! [15:29:08] klausman: let's also document all these steps on wikitech somewhere, I hope we won't need to recreate wikilabels but who knows [15:29:16] ack [15:31:00] Absolutely random question: Originally Lift Wing was built outside the Analytics VLAN for a variety of reasons, but one was that we wanted to separate our data lake from being accessible from the internet like Lift Wing would be. [15:31:00] Is that still true? I am just thinking about the feature store and where we would put it (analytics VLAN or production network) [15:32:47] yep yep, the kubernetes clusters ip spaces are completely separate from the analytics vlan, meanwhile the feature store may or may not be inside the Analytics VLAN (it really depends what we choose etc..) [15:32:59] I think the separation still makes sense from a security POV. Things like the FS are inherently less "PII-like" than a lot of the data visible in the Analytics VLAN. If a model were to be compromised, being in the Analytics VLAN is a lot more worrying than seeing only the FS [15:33:09] or maybe only part of it, the online feature store clusters will probably not be in the analytics vlan for example [15:42:44] Okay cool thanks [16:17:48] veeery interesting (testing locally with the new decorators) [16:18:06] [I 221031 16:17:55 decorators:13] Function get_revscoring_extractor_cache took 1.0602 seconds to execute. [16:18:09] [I 221031 16:17:57 decorators:25] Function fetch_features took 1.6160 seconds to execute. [16:18:12] [I 221031 16:17:57 decorators:13] Function run_in_process_pool took 0.1060 seconds to execute. [16:18:15] [I 221031 16:17:57 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (172.17.0.1) 2785.35ms [16:18:18] this is editquality goodfaith [16:18:20] aiko: --^ [16:18:33] now that I think about it, fetch_features is basically cpu-bound in our case [16:21:41] why it takes 1 second on my laptop to run that bit is obscure to me [16:24:41] Is that 1s with a cache? and if so, is it hit/miss? [16:25:39] the cache is all hits in theory, we pass it after populating it in the get_revscoring_extractor_cache function [16:28:56] elukey: wow that's interesting. I didn't expect fetch_features (using extractor.extract behind) need 1s. You're right, maybe fetch_features is cpu-bound. [16:32:08] I love revscoring [16:32:27] "Never a boring day" [16:32:44] aiko: how long before the RRR model is ready? :D :D :D [16:38:42] elukey: RRR model is ready basically. I'm now writing a section in our documentation about how to hit the discovery endpoint. [16:40:27] aiko: let's replace ORES models then :D [16:41:45] elukey: lol [16:49:24] aiko, klausman - I was able to set debug logging for revscoring and I don't see calls to the mw api for fetch_features (that calls extractor.extract()), so it seems all cpu time [16:50:30] I need to find a good way to set debug logging from env variables, I thought KSERVE_LOGLEVEL worked but apparently not [16:50:38] (or at least I can't really make it work) [16:50:53] it would be nice to restart a pod with a flag and get debug logging [16:53:21] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) Very interesting - I added more decorators to the code and after testing locally this is the result for enwiki-goodfaith: ` [I 221031 16:26:34 decorators:13] Function get_revs... [16:53:40] (debug logging on revscoring is too verbose, but having the decorators always in place to show function timings is probably nice) [16:53:55] anyway, going afk! Talk with you on wed folks! [16:54:01] have a good rest of the day [17:48:16] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage [17:48:33] I added a section about how to use the internal endpoint [17:49:13] although I'm not sure if it's a good idea to put the section there. [18:08:29] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10achou) @diego I added a section https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage in our documentation about how to access inference services inter...