[06:26:37] hello folks [06:26:48] need to run some errand this morning, will join a littler later [06:27:15] klausman: o/ later on let's sync about https://phabricator.wikimedia.org/T321310, we can split the reboots [08:08:51] back! [08:08:59] going to start with rebooting staging [08:15:55] btullis: ah I just realized that ml-staging is not supported in the cookbook, thanks for the pointer :) [08:30:25] elukey: https://bugs.python.org/issue24882 I wonder if there is any chance that the latencies problem you mentioned - sometimes p99 for api-ro.discovery.wmnet goes up to seconds - is caused by this bug of ThreadPoolExecutor. It was resolved in python 3.8 but we're using python 3.7 [08:31:10] elukey: I'm trying to write a simple test to verify it [08:31:15] aiko: o/ [08:31:40] I don't think that we use the thread pool executor at all [08:31:57] at least with our current async code, we should be using only the ioloop single thread [08:32:43] I think that those threads are used only if you explicitly run_in_executor() [08:33:01] (see the code about the blog post that we were discussing the other day) [08:34:00] but doesn't kserve use it here? https://github.com/kserve/kserve/blob/release-0.8/python/kserve/kserve/model_server.py#L130 [08:35:46] aiko: my understanding is that it simply sets the pool as default executor, but you have to use it [08:36:07] see https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor [08:37:23] IIUC the idea is that you can run in that thread pool blocking I/O code (in our case, blocking http requests that give up the GIL easily etc..) [08:38:01] if you don't run a function with loop.run_in_executor() then the code is handled by the single thread ioloop [08:38:09] in an async fashion [08:38:56] klausman, btullis - https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845430 [08:41:01] +1'd [08:41:17] thanks :) [08:41:24] I started ml-serve-codfw in the meantime [08:41:47] I can do the staging ones, unless someone is working on stuff there [08:45:48] we need to merge the patch before [08:45:56] Sure :) [08:46:43] I'll do the control plane for staging by hand since that needs doing anyway [08:53:43] elukey: the functions like blocking_io and cpu_bound in the example are not async functions. I think that is a workaround/low-level approach (not changing the function itself) to make blocking functions run asynchronously. In kserve model server we do use async preprocess and async http calls and asyncio.gather these high-level asyncio functions [08:54:02] elukey: but maybe what you said is right.. [08:54:44] aiko: yep yep I am aware, but the async functions should natively run on the main asyncio thread [08:54:54] not on the thread pool [08:59:48] elukey: yeah that makes sense [09:00:00] aiko: it is the only explanation that makes sense in my head [09:02:13] aiko: what is the avg latency of kserve responses when you test on minikube? [09:02:35] because in local testing I see something like 2s, so most of the time is dominated by http calls waiting [09:02:56] and I don't see benefits while using the process pool (but I verified, the extra processes are created) [09:03:13] I am wondering if we'd see good results on our k8s clusters though [09:03:33] something is not right, I feel I am missing something [09:04:01] Ok, staging-ctrl done. I'll do serve-ctrl in eqiad as well. [09:04:19] ack thanks [09:04:31] I am waiting for Ben to +1 the change before merging [09:05:02] elukey: avg latency is around 1s [09:06:07] oops, accidentally rebooting one of the ctrl vms for cserve-odfw :-| [09:06:19] elukey: but it's almost the same as without process pool [09:06:23] Oh well, it shouldn't break anything [09:06:41] klausman: nono please go ahead [09:07:04] I just didn't want to reboot 2002 while you reboot 2001 :) [09:07:13] aiko: yeah it makes sense, the time is dominated by the http calls, so the use of the process pool for the model.score is negligible [09:07:56] klausman: you can do the other ctrl as well, I am only doing workers [09:08:04] okey-dokey [09:13:32] aiko,klausman remember to fill the standup on slack :) [09:13:42] oh oops, yeah [09:16:21] ah thanks! [09:23:05] klausman: rebooting also the staging worker nodes [09:23:34] One of the VMs failed the status check in icinga [09:23:48] ifup@ens13.service didn't come up correctly, though it _seems_ the machine is fine [09:24:34] ml-serve-ctrl2001 [09:25:29] klausman: yeah it is a known issue https://phabricator.wikimedia.org/T273026 [09:27:00] Alright, did the flush thing [09:27:34] All ctrl nodes done, will update the ticket [09:30:20] super [09:38:25] klausman: elukey: Shall I kick off the reboot-nodes cookbook for dse-k8s then, or would one of you like to? [09:41:40] btullis: you can go ahead! [09:41:52] Ack, thanks. [09:42:00] elukey: if you want, I can do the workers for ml-serve-eqiad [09:42:16] klausman: go ahead [09:43:31] ok, running [09:49:34] https://leimao.github.io/blog/Python-Concurrency-High-Level/ illustrates well multiprocessing vs threading vs asyncio in python [09:52:20] that note at the end of the article about htop/top is just wrong tho. htop just defaults to showing all threads in a process, whereas top never does. [09:52:40] (thread display can be turned off in htop's options/settings) [09:55:15] aiko: that more or less follow what we were saying before right? [10:01:46] elukey: yeah I'm not sure, but I want to understand more about how tornado works behind kserve [10:02:11] aiko: IIUC the async part is 1:1 asyncio nowadays [10:02:19] it was different in the past, but they merged the behavior [10:03:48] elukey: what do you mean by 1:1 asyncio? [10:11:20] 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10hashar) [10:17:40] after reading the article, I feel we don't need to use multiprocessing to solve our problem, threading or asyncio should be enough if the bottleneck is http calls waiting. [10:18:04] aiko: we don't know if it is http only, this is the main issue [10:18:45] if the async thread loop blocks when running model.score, everything else blocks, including http calls [10:19:29] and not only the ones related to a single kserve request, all the ones landing to the model server blocks, since they are processed by the same asyncio thread [10:19:43] this is why I am saying that blocking code shouldn't be run on the asyncio thread [10:20:10] blocking the http calls may delay their overall execution, causing timeouts, etc.. [10:20:15] lemme know if it makes sense [10:21:50] elukey: it makes sense. have you tried running on a thread pool? [10:22:46] aiko: I tried with the run_in_executor() method, but no real change [10:23:40] aiko: in my head with all the work that you did for AsyncSession etc.. we should be able to scale to hundreds of HTTP calls without issues (unless something blocks them repeaditely) [10:25:48] elukey: yep it works well for the outlink model [10:25:55] ok I made a test in our model server code, the process pool effectively works, I see functions being run in separate processes [10:26:14] (I printed the process pids via logging) [10:29:42] aiko: I fear that even the outlink model may suffer from this, under certain load [10:29:48] it is blocking code as ell [10:29:50] *well [10:33:49] going afk for lunch, ttl! [10:42:20] 4/8 workers done in eqiad :) [10:42:27] also, lunch [10:44:15] FYI the cookbook still failed for dse-k8s :( [10:44:20] https://www.irccloud.com/pastebin/x4dVvTOU/ [10:45:22] I think it must be related to LVS, but haven't investigated too deeply yet. [10:48:47] 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10awight) @hashar +1 I don't think this was ever used, please feel free to delete it. If I'm wrong, it's easy to recreate from the (now out-of-date) upstream. [10:49:33] Hmm. host being still depooled is not something I have seen before [10:50:08] Normally, after the daemonset messages, there is a 35s sleep and then downtime is scheduled [10:54:28] 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10Miriam) HI @AikoChou this is wonderful wonderful, thank you so much! Most of my feedback was included already in Isaac's comments! Just a few more suggestions:... [12:07:14] eqiad all done [13:00:18] happy friday all! [13:14:24] o/ [13:14:26] btullis: weird! [13:15:07] 8 [13:15:09] uff [13:18:46] klausman: I see all nodes with 5.10.149-1 from cumin, our side should be done [13:20:01] btullis: ah yes it should be kubesvc [13:22:33] btullis, klausman - https://gerrit.wikimedia.org/r/c/operations/puppet/+/845544 [13:23:57] elukey: Ah, great. Many thanks. [13:27:48] btullis: merged and hosts set with pooled=yes/weight=1, can you retry the cookbook? [13:28:40] Running now. Looks good so far. [13:35:53] super [13:37:47] It's moved onto the second host, so all looking good. [13:55:52] ok very interesting thing [13:56:14] I created a benthos config to hit the same wiki/rev-id combination for enwiki-goodfaith, like I was doing with wrk for load testing [13:56:30] and I can reproduce the good latency behavior [13:56:40] no spikes up to know, everything good [13:57:11] so *some* rev-ids may trigger the weird behavior that I am investigating (spikes in latency, timeouts, etc..) [13:57:39] okok so at least something is starting to make sense :D [16:37:13] aiko: I think that we are also hitting a weird istio-proxy corner case, I added some info to the task [16:37:20] but the process pool idea still remains :) [16:37:24] anyway, weekend time :) [16:37:29] have a nice weekend folks! [17:02:11] \o [17:02:40] elukey: alright, have a nice weekend Luca! :) [17:05:11] aiko: don't forget to head into the weekend ;) I'm out. Talk to you on Monday! [17:11:00] klausman: enjoy the weekend! [17:30:03] 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10hashar) 05Open→03Resolved **Thank you** for the quick reply! ` This object will be destroyed forever: - R2282 (PhabricatorRepository) R2282 operation/debs/word2...