[03:22:41] (03PS3) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [03:25:40] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [03:34:29] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9824152 (10isarantopoulos) 05Open→03Resolved [04:32:41] hello! [05:24:37] (03PS4) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [05:30:19] (03CR) 10CI reject: [V:04-1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [05:44:53] I'm trying to deploy mistral using the GPU on ml-staging and all pods are being evicted with the following message [05:44:53] `Warning Evicted 2m18s kubelet The node had condition: [DiskPressure].` [06:50:25] (03PS5) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [06:50:44] (03CR) 10CI reject: [V:04-1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [06:52:35] (03CR) 10Ilias Sarantopoulos: "✔ The image has been tested locally (with cpu version of torch) and works as expected." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [08:05:59] morning! [08:06:09] isaranto: I think I know what the problem is, trying a fix [08:07:35] hey Tobias [08:07:37] ok thanks! [08:13:19] The root cause is likely that the kubelet partition (/var/lib/kubelet) is too small. Our standard install recipe uses only something like 30G, and a while back we increased all of them to ~110. But since the reinstall with bookworm, the partition on the staging host is too small. I just bumped that and am rebooting the host to make sure it's picked up correctly. Should be back in a coiple of [08:13:21] minues [08:16:22] and I clearly do not have enough caffeine in my system: rebooted the wrong host :-/ [08:26:27] isaranto: I think it should work now. It's downloading stuff. One question, I see it downloading llm/Mistral-7B-Instruct-v0.2/model-00001-of-00003.safetensors and llm/Mistral-7B-Instruct-v0.2/pytorch_model-00001-of-00003.bin (and 002/003 of both varieties). is that necessary? [08:28:15] klausman: nope! we had mentioned this previously but I never worked on it since we were working on the GPU issue. I'll work on this also today (first test, then leave only what is necessary in the swift bucket) [08:28:35] ack, thanks! [08:28:43] thanks for reminding! [08:28:52] The pod is still initializing, but the download seems to have worked fine [08:28:57] ah, it's running now :) [08:29:01] coolio! [08:29:36] no GPU issue in the logs. 🎉 [08:30:06] Hooray! [08:30:47] and I am seeing good log lines for a POST request just now [08:31:03] 2024-05-23 08:30:18.071 kserve.trace kserve.io.kserve.protocol.rest.v1_endpoints.predict: 1.9031920433044434 [08:31:53] got a response in less than 2seconds! [08:32:08] that's great for a 7B param model [08:33:09] going afk for an hour , will report all my findings later! [08:33:23] ack! [09:09:27] back earlier! 2nd time this year I book a doc appt and the doc takes a sick leave :) [09:12:27] (03PS6) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [09:57:50] (03CR) 10Klausman: [C:03+1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [10:04:17] * klausman lunch [10:07:46] klausman: thanks for the review. I'll upgrade kserve to latest version before I push since there have been new changes since ( a new release candidate that fixes some issues and supports more hf models- kserve 13rc1) [10:32:55] fyi: https://gitlab.wikimedia.org/repos/releng/blubber/-/blob/main/CHANGELOG.md?ref_type=heads#v0230 [10:33:33] new blubber version uses a virtualenv, and solves the --break-system-packages workaround described here https://phabricator.wikimedia.org/T346090 [10:50:33] Neat re: blubber, and ack regarding kserve version [11:14:53] * isaranto lunch [12:34:11] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701 (10Samwalton9-WMF) 03NEW [12:35:01] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#9825377 (10Samwalton9-WMF) [12:35:08] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9825378 (10Samwalton9-WMF) [12:54:11] o/ [12:54:19] totally forgot about the kubelet partition, nice fix! [12:54:30] o/ [12:54:48] --^ revertrisk on recent changes is happening! [12:54:50] so IIUC mistral 7b returns a result in ~2s? [12:55:54] yes! [12:56:19] super [12:56:26] something seems to start working :D [12:56:31] I still havent updated the task, I'm building and testing the image with the newer versions and the rest of my day is with meetings BUT [12:56:58] it is a huge improvement as previous reported similar request was 30 seconds! https://phabricator.wikimedia.org/T362670#9741128 [12:57:02] I have an idea about how to refactor the pytorch base images in production-images, will try to send a patch later on [12:58:13] ok! I also want to submit another patch as I'd like to also install pytorch-triton-rocm from the same repo. that way we won't have to do anything with https://download.pytorch.org/ repo in inference-services [12:59:28] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825487 (10elukey) [13:01:19] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825495 (10elukey) I had to copy more packages (updated the task's description), but everything worked fine on ml-staging2001. The ML team is unblocked and can... [13:20:56] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825618 (10MoritzMuehlenhoff) Dragonfly is an internally built golang package, it would be better if we properly rebuilt it on bookworm with current Go, otherwi... [13:33:12] oh my I had been building the image with wrong base image (the one with the pip files in it) [13:33:19] and i was wondering why it is that big [13:34:34] (03PS7) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [13:36:19] (03CR) 10Klausman: [C:03+1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [13:56:16] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9825766 (10elukey) >>! In T365253#9825618, @MoritzMuehlenhoff wrote: > Dragonfly is an internally built golang package, it would be better if we properly rebuil... [13:56:25] (03PS8) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [13:56:43] ok we're there it is ready, and compressed size was 3.14GB [14:37:43] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1035441 is the refactor that I imagined [14:38:00] to simplify a little the current config [14:39:50] on it! [14:40:24] are you folks ok with me merging this https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1032777? [14:41:03] all module updates come from the newest version of kserve(huggingfaceserver) and vllm [14:41:22] (03CR) 10Elukey: [C:03+1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [14:41:35] I haven't checked all the package upgrades but it looks good [14:44:09] Good morning all [14:45:52] morning Chris! [14:49:34] elukey: there are also some changes in blubber https://gitlab.wikimedia.org/repos/releng/blubber/-/commit/0ecc6f22465fc989dd09e2e9c44a4f5ab9e9c53a that will allow us to remove the --break-system-packages directive. From now on in blubber everything in python is handled in a virtualenv so it would work. Not saying we do it now, just as fyi. I'll open a task and we'll do it later on when we upgrade to blubber v0.23.0 [14:49:39] in order to test it etc [14:49:52] just mentioning it since you had worked on it [14:50:42] (03CR) 10Ilias Sarantopoulos: [C:03+2] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [14:50:56] 🤞 [14:51:26] (03Merged) 10jenkins-bot: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [14:52:26] nice! [16:08:43] 06Machine-Learning-Team, 13Patch-For-Review: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9826503 (10isarantopoulos) Currently getting a CrashLoopBackoff in the pod with the updated image. However there is something I missed during the update:... [16:30:54] 06Machine-Learning-Team, 13Patch-For-Review: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9826605 (10isarantopoulos) Currently investigating the issue to see if MI 100 (gfx908) is supported by vllm after all. Although documentation mentioned ab... [16:30:58] going afk folks, have a nice evening! [17:33:19] o/ [18:37:19] (03CR) 10Jdlrobson: [C:03+2] i18n: Replace mw: interwiki with url to mediawiki.org [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1032071 (owner: 10Umherirrender) [18:53:02] 10Lift-Wing: Test LiftWing API/Predictions from Hadoop - https://phabricator.wikimedia.org/T304425#9827229 (10fkaelin) 05Open→03Resolved a:03fkaelin Closing this as resolved. After more discussion and some experimentation, it was decided that doing batch inference within the distributed jobs (e.g. by b... [19:03:13] (03Merged) 10jenkins-bot: i18n: Replace mw: interwiki with url to mediawiki.org [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1032071 (owner: 10Umherirrender) [19:59:07] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9827458 (10KStoller-WMF) Thanks for the detailed explanation! Ideally want communities to be able to easily disable and enable this t...