[07:23:45] <isaranto>	 Good morning folks!
[07:23:56] <isaranto>	 This is happenning this week https://home.mlops.community/public/events/ai-in-production-2024-02-15
[08:25:41] <kevinbazira>	 isaranto: o/
[08:25:41] <kevinbazira>	 thank you for sharing the MLOps community event. it runs at night (8pm - 4am) on my end, but I'll try to attend the sessions I can :)
[08:26:21] <kevinbazira>	 meanwhile, as we discussed in yesterday's meeting, I have evaluated the runtimes for the `preprocess()` and `predict()` functions of the article-descriptions model-server hosted on LiftWing. The results indicate that `preprocess()` runs in about 0.4s while `predict()` takes >2s to run as shown here:
[08:26:21] <kevinbazira>	 https://phabricator.wikimedia.org/P57453
[08:27:03] <kevinbazira>	 considering you reported in https://phabricator.wikimedia.org/T343123#9520331 that the bottleneck is the preprocess step, I hesitated to comment in the phab task, as I didn't want us to contradict each other.
[08:41:45] <isaranto>	 kevinbazira: o/ The results I reported were for the load tests and not for a single request so they can't compare. The specific request (Clandonald) was completed in ~1s with the GPU so it is definitely not the problem. perhaps some of the other requests are taking a lot of time to complete because the articles are big or sth similar
[08:43:48] <isaranto>	 I think we can do the following steps: 
[08:43:48] <isaranto>	 1. Run a load test only with a couple of requests we know are fast (Clandonald)
[08:43:48] <isaranto>	 2. Run the full load test
[08:43:48] <isaranto>	 3.  Check load test results and the grafana dashboards for both
[08:43:48] <isaranto>	 3. Run the same load tests from localhost. Check the logs to see if preprocess results match those of Lift Wing
[08:43:51] <isaranto>	 wdyt?
[08:44:44] <isaranto>	 What we're looking for is for discrepancies between local/ml-sandbox requests and Lift Wing to identify if there is any network bottleneck that is causing the increased latency
[08:48:16] <wikibugs>	 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9562463 (10isarantopoulos) We're looking into the increased preprocessing times that we reported to chec...
[08:58:54] <isaranto>	 afk for a bit - commuting
[09:10:36] <kevinbazira>	 isaranto: sounds good
[09:10:36] <kevinbazira>	 the only issue with the current `locust` setup is that running full load tests without the grafana dashboards does not show average runtimes for the 'preprocess()' function which is what Seddon and Isaac are asking about.
[09:10:36] <kevinbazira>	 I believe we can compare the runtimes of the `preprocess()` and `predict()` functions in LiftWing and LocalServer (ml-sandbox) by running individual requests using the sample inputs in https://phabricator.wikimedia.org/P54507. this will help us determine if there is any discrepancy between LiftWing and LocalServer.
[09:37:57] <isaranto>	 we don't get average times but we do get percentiles which is better for drawing conclusions. We can also check the actual preprocess times from the logs
[09:53:28] <aiko>	 o/
[09:53:34] <aiko>	 morning!
[09:55:42] <isaranto>	 hey Aiko!
[10:19:53] <isaranto>	 https://github.com/astral-sh/uv
[10:19:53] <isaranto>	 https://astral.sh/blog/uv
[10:20:08] <isaranto>	 this is awesome! super fast
[10:21:25] <isaranto>	 it is a new python package manager and installer, can act as a drop in replacement for pip (doesnt support 100% all pip commands though)
[11:17:39] * isaranto afk lunch
[12:30:36] <wikibugs>	 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9563246 (10isarantopoulos) We'll need to adapt the original image so that it uses Debian so that we align with [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Operating_System |...
[12:37:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I think this should work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[12:42:51] <kevinbazira>	 isaranto: using the sample inputs provided in P54507, I compared the performance of `preprocess()` and `predict()` for Liftwing vs LocalServer . The results are available at https://phabricator.wikimedia.org/P57453#232415.
[12:42:51] <kevinbazira>	 To summarize, the results indicate that `preprocess()` is slower on Liftwing, but `predict()` is the slowest of the two steps on both Liftwing and LocalServer, making it the main bottleneck.
[12:43:51] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[12:44:55] <isaranto>	 kevinbazira: Nice work! this validates the original assumption and there is a network issue on Lift Wing
[12:45:59] <isaranto>	 a note regarding predict step: indeed it is the bottleneck but at this step we are discussing about preprocessing. Predict stops being the bottleneck when utilizing the GPU as we saw earlier
[12:46:29] <isaranto>	 can you paste the message on the task instead of the paste so it doesnt get lost?
[12:46:53] <isaranto>	 now we need to add some logging to see which request takes so much time. 
[12:49:27] <wikibugs>	 (03Merged) 10jenkins-bot: revertrisk-multilingual: reorder requirements.txt [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[12:56:08] <isaranto>	 We can see the latencies by backend in grafana https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=experimental&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-3h&to=now
[12:57:25] <isaranto>	 I see some spikes related to rest gateway in p95 and p99 (rest-gateway.discovery.wmnet 200  is the entry I'm referring to)
[13:21:01] <wikibugs>	 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9563373 (10isarantopoulos) I'm proceeding to adapt the upstream image to use debian with rocm as the original one can't be built without a cuda runtime
[13:42:10] <kevinbazira>	 isaranto: based on the link you shared, if we skip `article-descriptions-predictor-default-00025-private`, `InboundPassthroughClusterIpv4` has the highest latency spike at 4.8s in p0.99 .
[13:42:10] <kevinbazira>	 following closely behind is `rest-gateway.discovery.wmnet` with a spike of 4.65 seconds.
[13:42:10] <kevinbazira>	 lastly, `api-ro.discovery.wmnet` has the lowest spike at 246ms.
[14:15:27] <isaranto>	 just in: new "open source" LLMs by google https://huggingface.co/models?other=gemma&sort=trending&search=google
[14:15:56] <isaranto>	 I'm saying "open source" because I see some custom license - haven't looked at it yet
[14:47:03] <wikibugs>	 10Machine-Learning-Team: Test revertrisk-multilingual with GPU - https://phabricator.wikimedia.org/T356045#9563696 (10achou) The latest changes to requirements.txt still resulted in a failed docker image build. Therefore, the torch version conflict between the knowledge integrity and inference services repo was...
[14:58:01] <wikibugs>	 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9563739 (10achou) Following a discussion with Ilias, we will keep an eye on the progress of https://github.com/kserve/kserve/pull/3374. Once the PR is merged, we...
[15:02:34] <wikibugs>	 (03CR) 10Atieno: [C: 03+1] Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)
[16:50:23] <wikibugs>	 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9564281 (10isarantopoulos) I have started adding an image in a fork of the [[ https://github.com/isaranto/kserve/blob/kserve-hf-rocm/python/huggingface_server_debian.Dockerfile | kserve r...
[17:15:24] <wikibugs>	 (03PS3) 10Jforrester: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)
[17:16:00] <isaranto>	 going afk folks o/
[17:16:27] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)
[17:35:14] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)