[06:26:37] <elukey>	 hello folks
[06:26:48] <elukey>	 need to run some errand this morning, will join a littler later
[06:27:15] <elukey>	 klausman: o/ later on let's sync about https://phabricator.wikimedia.org/T321310, we can split the reboots
[08:08:51] <elukey>	 back!
[08:08:59] <elukey>	 going to start with rebooting staging
[08:15:55] <elukey>	 btullis: ah I just realized that ml-staging is not supported in the cookbook, thanks for the pointer :)
[08:30:25] <aiko>	 elukey: https://bugs.python.org/issue24882 I wonder if there is any chance that the latencies problem you mentioned - sometimes p99 for api-ro.discovery.wmnet goes up to seconds - is caused by this bug of ThreadPoolExecutor. It was resolved in python 3.8 but we're using python 3.7
[08:31:10] <aiko>	 elukey: I'm trying to write a simple test to verify it
[08:31:15] <elukey>	 aiko: o/
[08:31:40] <elukey>	 I don't think that we use the thread pool executor at all
[08:31:57] <elukey>	 at least with our current async code, we should be using only the ioloop single thread
[08:32:43] <elukey>	 I think that those threads are used only if you explicitly run_in_executor()
[08:33:01] <elukey>	 (see the code about the blog post that we were discussing the other day)
[08:34:00] <aiko>	 but doesn't kserve use it here? https://github.com/kserve/kserve/blob/release-0.8/python/kserve/kserve/model_server.py#L130
[08:35:46] <elukey>	 aiko: my understanding is that it simply sets the pool as default executor, but you have to use it
[08:36:07] <elukey>	 see https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor
[08:37:23] <elukey>	 IIUC the idea is that you can run in that thread pool blocking I/O code (in our case, blocking http requests that give up the GIL easily etc..)
[08:38:01] <elukey>	 if you don't run a function with loop.run_in_executor() then the code is handled by the single thread ioloop
[08:38:09] <elukey>	 in an async fashion
[08:38:56] <elukey>	 klausman, btullis - https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/845430
[08:41:01] <klausman>	 +1'd
[08:41:17] <elukey>	 thanks :)
[08:41:24] <elukey>	 I started ml-serve-codfw in the meantime
[08:41:47] <klausman>	 I can do the staging ones, unless someone is working on stuff there
[08:45:48] <elukey>	 we need to merge the patch before
[08:45:56] <klausman>	 Sure :)
[08:46:43] <klausman>	 I'll do the control plane for staging by hand since that needs doing anyway
[08:53:43] <aiko>	 elukey: the functions like blocking_io and cpu_bound in the example are not async functions. I think that is a workaround/low-level approach (not changing the function itself) to make blocking functions run asynchronously. In kserve model server we do use async preprocess and async http calls and asyncio.gather these high-level asyncio functions 
[08:54:02] <aiko>	 elukey: but maybe what you said is right..
[08:54:44] <elukey>	 aiko: yep yep I am aware, but the async functions should natively run on the main asyncio thread
[08:54:54] <elukey>	 not on the thread pool
[08:59:48] <aiko>	 elukey: yeah that makes sense
[09:00:00] <elukey>	 aiko: it is the only explanation that makes sense in my head
[09:02:13] <elukey>	 aiko: what is the avg latency of kserve responses when you test on minikube?
[09:02:35] <elukey>	 because in local testing I see something like 2s, so most of the time is dominated by http calls waiting
[09:02:56] <elukey>	 and I don't see benefits while using the process pool (but I verified, the extra processes are created)
[09:03:13] <elukey>	 I am wondering if we'd see good results on our k8s clusters though
[09:03:33] <elukey>	 something is not right, I feel I am missing something
[09:04:01] <klausman>	 Ok, staging-ctrl done. I'll do serve-ctrl in eqiad as well.
[09:04:19] <elukey>	 ack thanks
[09:04:31] <elukey>	 I am waiting for Ben to +1 the change before merging
[09:05:02] <aiko>	 elukey: avg latency is around 1s
[09:06:07] <klausman>	 oops, accidentally rebooting one of the ctrl vms for cserve-odfw :-|
[09:06:19] <aiko>	 elukey: but it's almost the same as without process pool
[09:06:23] <klausman>	 Oh well, it shouldn't break anything
[09:06:41] <elukey>	 klausman: nono please go ahead
[09:07:04] <klausman>	 I just didn't want to reboot 2002 while you reboot 2001 :)
[09:07:13] <elukey>	 aiko: yeah it makes sense, the time is dominated by the http calls, so the use of the process pool for the model.score is negligible
[09:07:56] <elukey>	 klausman: you can do the other ctrl as well, I am only doing workers
[09:08:04] <klausman>	 okey-dokey
[09:13:32] <elukey>	 aiko,klausman remember to fill the standup on slack :)
[09:13:42] <klausman>	 oh oops, yeah
[09:16:21] <aiko>	 ah thanks!
[09:23:05] <elukey>	 klausman: rebooting also the staging worker nodes
[09:23:34] <klausman>	 One of the VMs failed the status check in icinga
[09:23:48] <klausman>	 ifup@ens13.service didn't come up correctly, though it _seems_ the machine is fine
[09:24:34] <klausman>	 ml-serve-ctrl2001
[09:25:29] <elukey>	 klausman: yeah it is a known issue https://phabricator.wikimedia.org/T273026
[09:27:00] <klausman>	 Alright, did the flush thing
[09:27:34] <klausman>	 All ctrl nodes done, will update the ticket
[09:30:20] <elukey>	 super
[09:38:25] <btullis>	 klausman: elukey: Shall I kick off the reboot-nodes cookbook for dse-k8s then, or would one of you like to?
[09:41:40] <elukey>	 btullis: you can go ahead! 
[09:41:52] <btullis>	 Ack, thanks.
[09:42:00] <klausman>	 elukey: if you want, I can do the workers for ml-serve-eqiad
[09:42:16] <elukey>	 klausman: go ahead
[09:43:31] <klausman>	 ok, running
[09:49:34] <aiko>	 https://leimao.github.io/blog/Python-Concurrency-High-Level/ illustrates well multiprocessing vs threading vs asyncio in python
[09:52:20] <klausman>	 that note at the end of the article about htop/top is just wrong tho. htop just defaults to showing all threads in a process, whereas top never does.
[09:52:40] <klausman>	 (thread display can be turned off in htop's options/settings)
[09:55:15] <elukey>	 aiko: that more or less follow what we were saying before right?
[10:01:46] <aiko>	 elukey: yeah I'm not sure, but I want to understand more about how tornado works behind kserve
[10:02:11] <elukey>	 aiko: IIUC the async part is 1:1 asyncio nowadays
[10:02:19] <elukey>	 it was different in the past, but they merged the behavior
[10:03:48] <aiko>	 elukey: what do you mean by 1:1 asyncio?
[10:11:20] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10hashar)
[10:17:40] <aiko>	 after reading the article, I feel we don't need to use multiprocessing to solve our problem, threading or asyncio should be enough if the bottleneck is http calls waiting.
[10:18:04] <elukey>	 aiko: we don't know if it is http only, this is the main issue
[10:18:45] <elukey>	 if the async thread loop blocks when running model.score, everything else blocks, including http calls
[10:19:29] <elukey>	 and not only the ones related to a single kserve request, all the ones landing to the model server blocks, since they are processed by the same asyncio thread
[10:19:43] <elukey>	 this is why I am saying that blocking code shouldn't be run on the asyncio thread
[10:20:10] <elukey>	 blocking the http calls may delay their overall execution, causing timeouts, etc..
[10:20:15] <elukey>	 lemme know if it makes sense
[10:21:50] <aiko>	 elukey: it makes sense. have you tried running on a thread pool? 
[10:22:46] <elukey>	 aiko: I tried with the run_in_executor() method, but no real change
[10:23:40] <elukey>	 aiko: in my head with all the work that you did for AsyncSession etc.. we should be able to scale to hundreds of HTTP calls without issues (unless something blocks them repeaditely)
[10:25:48] <aiko>	 elukey: yep it works well for the outlink model
[10:25:55] <elukey>	 ok I made a test in our model server code, the process pool effectively works, I see functions being run in separate processes
[10:26:14] <elukey>	 (I printed the process pids via logging)
[10:29:42] <elukey>	 aiko: I fear that even the outlink model may suffer from this, under certain load
[10:29:48] <elukey>	 it is blocking code as ell
[10:29:50] <elukey>	 *well
[10:33:49] <elukey>	 going afk for lunch, ttl!
[10:42:20] <klausman>	 4/8 workers done in eqiad :)
[10:42:27] <klausman>	 also, lunch
[10:44:15] <btullis>	 FYI the cookbook still failed for dse-k8s :(
[10:44:20] <btullis>	 https://www.irccloud.com/pastebin/x4dVvTOU/
[10:45:22] <btullis>	 I think it must be related to LVS, but haven't investigated too deeply yet.
[10:48:47] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10awight) @hashar +1 I don't think this was ever used, please feel free to delete it.  If I'm wrong, it's easy to recreate from the (now out-of-date) upstream.
[10:49:33] <klausman>	 Hmm. host being still depooled is not something I have seen before
[10:50:08] <klausman>	 Normally, after the daemonset messages, there is a 35s sleep and then downtime is scheduled
[10:54:28] <wikibugs>	 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10Miriam) HI @AikoChou this is wonderful wonderful, thank you so much!  Most of my feedback was included already in Isaac's comments! Just a few more suggestions:...
[12:07:14] <klausman>	 eqiad all done
[13:00:18] <chrisalbon>	 happy friday all!
[13:14:24] <elukey>	 o/
[13:14:26] <elukey>	 btullis: weird!
[13:15:07] <elukey>	 8
[13:15:09] <elukey>	 uff
[13:18:46] <elukey>	 klausman: I see all nodes with 5.10.149-1 from cumin, our side should be done
[13:20:01] <elukey>	 btullis: ah yes it should be kubesvc
[13:22:33] <elukey>	 btullis, klausman - https://gerrit.wikimedia.org/r/c/operations/puppet/+/845544
[13:23:57] <btullis>	 elukey: Ah, great. Many thanks.
[13:27:48] <elukey>	 btullis: merged and hosts set with pooled=yes/weight=1, can you retry the cookbook?
[13:28:40] <btullis>	 Running now. Looks good so far.
[13:35:53] <elukey>	 super
[13:37:47] <btullis>	 It's moved onto the second host, so all looking good.
[13:55:52] <elukey>	 ok very interesting thing
[13:56:14] <elukey>	 I created a benthos config to hit the same wiki/rev-id combination for enwiki-goodfaith, like I was doing with wrk for load  testing
[13:56:30] <elukey>	 and I can reproduce the good latency behavior
[13:56:40] <elukey>	 no spikes up to know, everything good
[13:57:11] <elukey>	 so *some* rev-ids may trigger the weird behavior that I am investigating (spikes in latency, timeouts, etc..)
[13:57:39] <elukey>	 okok so at least something is starting to make sense :D
[16:37:13] <elukey>	 aiko: I think that we are also hitting a weird istio-proxy corner case, I added some info to the task
[16:37:20] <elukey>	 but the process pool idea still remains :)
[16:37:24] <elukey>	 anyway, weekend time :)
[16:37:29] <elukey>	 have a nice weekend folks!
[17:02:11] <klausman>	 \o
[17:02:40] <aiko>	 elukey: alright, have a nice weekend Luca! :)
[17:05:11] <klausman>	 aiko: don't forget to head into the weekend ;) I'm out. Talk to you on Monday!
[17:11:00] <aiko>	 klausman: enjoy the weekend! 
[17:30:03] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Phabricator: Investigate usage of word2vec Debian package - https://phabricator.wikimedia.org/T321383 (10hashar) 05Open→03Resolved **Thank you** for the quick reply! ` This object will be destroyed forever:      - R2282 (PhabricatorRepository) R2282 operation/debs/word2...