[07:23:45] Good morning folks! [07:23:56] This is happenning this week https://home.mlops.community/public/events/ai-in-production-2024-02-15 [08:25:41] isaranto: o/ [08:25:41] thank you for sharing the MLOps community event. it runs at night (8pm - 4am) on my end, but I'll try to attend the sessions I can :) [08:26:21] meanwhile, as we discussed in yesterday's meeting, I have evaluated the runtimes for the `preprocess()` and `predict()` functions of the article-descriptions model-server hosted on LiftWing. The results indicate that `preprocess()` runs in about 0.4s while `predict()` takes >2s to run as shown here: [08:26:21] https://phabricator.wikimedia.org/P57453 [08:27:03] considering you reported in https://phabricator.wikimedia.org/T343123#9520331 that the bottleneck is the preprocess step, I hesitated to comment in the phab task, as I didn't want us to contradict each other. [08:41:45] kevinbazira: o/ The results I reported were for the load tests and not for a single request so they can't compare. The specific request (Clandonald) was completed in ~1s with the GPU so it is definitely not the problem. perhaps some of the other requests are taking a lot of time to complete because the articles are big or sth similar [08:43:48] I think we can do the following steps: [08:43:48] 1. Run a load test only with a couple of requests we know are fast (Clandonald) [08:43:48] 2. Run the full load test [08:43:48] 3. Check load test results and the grafana dashboards for both [08:43:48] 3. Run the same load tests from localhost. Check the logs to see if preprocess results match those of Lift Wing [08:43:51] wdyt? [08:44:44] What we're looking for is for discrepancies between local/ml-sandbox requests and Lift Wing to identify if there is any network bottleneck that is causing the increased latency [08:48:16] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9562463 (10isarantopoulos) We're looking into the increased preprocessing times that we reported to chec... [08:58:54] afk for a bit - commuting [09:10:36] isaranto: sounds good [09:10:36] the only issue with the current `locust` setup is that running full load tests without the grafana dashboards does not show average runtimes for the 'preprocess()' function which is what Seddon and Isaac are asking about. [09:10:36] I believe we can compare the runtimes of the `preprocess()` and `predict()` functions in LiftWing and LocalServer (ml-sandbox) by running individual requests using the sample inputs in https://phabricator.wikimedia.org/P54507. this will help us determine if there is any discrepancy between LiftWing and LocalServer. [09:37:57] we don't get average times but we do get percentiles which is better for drawing conclusions. We can also check the actual preprocess times from the logs [09:53:28] o/ [09:53:34] morning! [09:55:42] hey Aiko! [10:19:53] https://github.com/astral-sh/uv [10:19:53] https://astral.sh/blog/uv [10:20:08] this is awesome! super fast [10:21:25] it is a new python package manager and installer, can act as a drop in replacement for pip (doesnt support 100% all pip commands though) [11:17:39] * isaranto afk lunch [12:30:36] 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9563246 (10isarantopoulos) We'll need to adapt the original image so that it uses Debian so that we align with [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Operating_System |... [12:37:42] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I think this should work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:42:51] isaranto: using the sample inputs provided in P54507, I compared the performance of `preprocess()` and `predict()` for Liftwing vs LocalServer . The results are available at https://phabricator.wikimedia.org/P57453#232415. [12:42:51] To summarize, the results indicate that `preprocess()` is slower on Liftwing, but `predict()` is the slowest of the two steps on both Liftwing and LocalServer, making it the main bottleneck. [12:43:51] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:44:55] kevinbazira: Nice work! this validates the original assumption and there is a network issue on Lift Wing [12:45:59] a note regarding predict step: indeed it is the bottleneck but at this step we are discussing about preprocessing. Predict stops being the bottleneck when utilizing the GPU as we saw earlier [12:46:29] can you paste the message on the task instead of the paste so it doesnt get lost? [12:46:53] now we need to add some logging to see which request takes so much time. [12:49:27] (03Merged) 10jenkins-bot: revertrisk-multilingual: reorder requirements.txt [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1005486 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [12:56:08] We can see the latencies by backend in grafana https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=experimental&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-3h&to=now [12:57:25] I see some spikes related to rest gateway in p95 and p99 (rest-gateway.discovery.wmnet 200 is the entry I'm referring to) [13:21:01] 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9563373 (10isarantopoulos) I'm proceeding to adapt the upstream image to use debian with rocm as the original one can't be built without a cuda runtime [13:42:10] isaranto: based on the link you shared, if we skip `article-descriptions-predictor-default-00025-private`, `InboundPassthroughClusterIpv4` has the highest latency spike at 4.8s in p0.99 . [13:42:10] following closely behind is `rest-gateway.discovery.wmnet` with a spike of 4.65 seconds. [13:42:10] lastly, `api-ro.discovery.wmnet` has the lowest spike at 246ms. [14:15:27] just in: new "open source" LLMs by google https://huggingface.co/models?other=gemma&sort=trending&search=google [14:15:56] I'm saying "open source" because I see some custom license - haven't looked at it yet [14:47:03] 10Machine-Learning-Team: Test revertrisk-multilingual with GPU - https://phabricator.wikimedia.org/T356045#9563696 (10achou) The latest changes to requirements.txt still resulted in a failed docker image build. Therefore, the torch version conflict between the knowledge integrity and inference services repo was... [14:58:01] 10Machine-Learning-Team: Assess runtime performance impact of pydantic data models in the RRLA model-server - https://phabricator.wikimedia.org/T355742#9563739 (10achou) Following a discussion with Ilias, we will keep an eye on the progress of https://github.com/kserve/kserve/pull/3374. Once the PR is merged, we... [15:02:34] (03CR) 10Atieno: [C: 03+1] Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup) [16:50:23] 10Machine-Learning-Team, 10Goal: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9564281 (10isarantopoulos) I have started adding an image in a fork of the [[ https://github.com/isaranto/kserve/blob/kserve-hf-rocm/python/huggingface_server_debian.Dockerfile | kserve r... [17:15:24] (03PS3) 10Jforrester: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup) [17:16:00] going afk folks o/ [17:16:27] (03CR) 10Jforrester: [C: 03+2] Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup) [17:35:14] (03Merged) 10jenkins-bot: Migrate away from wfGetDB() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1003397 (https://phabricator.wikimedia.org/T330641) (owner: 10Ladsgroup)