[06:51:00] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 10MoveComms-Support, 07Chinese-Sites: Support languages whose add-a-link models were not published - https://phabricator.wikimedia.org/T309263#10640336 (10Aklapper) [Removing outdated 2023 project tag] [07:34:18] (03PS1) 10Kevin Bazira: article-country: include empty country links in tfidf_sum for normalization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) [08:14:29] good morning! [08:40:57] good morning all [08:48:53] o/ George [09:14:39] \o [09:54:26] I will try to use https://docker-registry.wikimedia.org/amd-pytorch25/tags/ for edit-check on gpu [09:59:25] ack! for cross reference check the huggingfaceserve and llm blubber files if anything else needs to be added to the image [10:00:00] iirc the pythonpath needs to be altered (as in llm image PYTHONPATH: /opt/lib/python/site-packages:/srv/app:/opt/lib/venv/lib/python3.11/site-packages) as the newer blubber versions dont extend the pythonpath (they just replace it) [10:00:16] ping me if you face any issue and we can take a look together [10:02:04] isaranto: how I can test it ? [10:03:07] try it locally first. you can still build the image and since to gpu is found it will fall back to cpu [10:04:13] ok ty [10:04:33] it should work I mean, again if you face any issue lemme know [10:56:48] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Thanks Kevin for working on this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [10:58:42] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, Ilias!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [10:59:27] (03Merged) 10jenkins-bot: article-country: include empty country links in tfidf_sum for normalization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [11:07:27] * isaranto afk lunch [12:21:46] there's been a brief spike of lw_inference_reference_need_cluster errors which went beyond the paging threshold (https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-1h&to=now&refresh=30s) it went down again, was that caused by some deployment or similar? [12:26:06] moritzm o/ thanks for the ping. there was no new deployment but this has been ongoing for a week. We are working to issue an improvement/fix that will make the service more viable. We'll have sth probably today or tomorrow [12:30:01] ok, thanks! [12:30:37] is there any alert that fired we are missing? [12:36:49] the alert came via the API gateway, which alerts on elevated 5xxs for all the services behind it [12:36:51] "GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad" [12:37:41] but it's okay, if the underlying issue is being worked on, I'll simpy drop a note the the US timezone on call folks so that they are aware [13:03:35] oh okay, thanks! [13:24:04] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641633 (10kevinbazira) >>! In T385970#10636566, @Isaac wrote: > as we discussed separately, let's update the code to still inc... [13:43:40] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641713 (10Isaac) Looks great -- thanks!! [13:52:51] (03PS1) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [13:53:20] (03PS2) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [13:54:56] (03PS3) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [13:59:42] (03PS4) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [14:00:33] folks I need a review in the above, for the reference-need issues. happy to jump in a call in 1h (after a meeting) if anyone needs clarifying questions [14:03:39] i'm going to share some local load testing on the phabricator task in a bit, and then I plan to run more load testing on ml-staging [14:05:38] I'm on it [14:05:49] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641810 (10kevinbazira) >>! In T385970#10641713, @Isaac wrote: > Looks great -- thanks!! super! article-country wikilink-relate... [14:05:59] the article-country model-server with wikilink-related predictions is now live in prod --^ [14:11:39] \o/ [14:21:28] (03CR) 10Gkyziridis: "Thank you for working on this Ilias." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:22:23] (03CR) 10Gkyziridis: [C:03+1] reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:43:41] Folks I am having some issues building this image: https://docker-registry.wikimedia.org/amd-pytorch25/tags/ [14:44:12] I am getting: ```- Requested platform "linux/arm64" does not match result platform "linux/amd64"``` [14:45:01] and when the container runs I am getting: [14:45:01] ```E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)``` [14:45:44] how are you building the image? Docker-pkg? [14:45:50] I tried to built it with multiple ways, local blubber, pulling it locally and then build it, and via Docker FILE [14:46:06] I am on mac m4 [14:46:16] yep yep I figured [14:46:39] but have you used docker-pkg to build it? [14:47:31] I have the docker deamon installed and I just did: ```docker build -t edit_chceck:gpu . ``` [14:47:43] not sure if I answered :P [14:49:08] the strange thing is that the selected builders in docker engine is `docker-linux` which says that supports: arm64 [14:49:08] amd64 [14:49:08] amd64/v2 [14:49:08] riscv64 [14:49:08] ppc64le [14:49:08] s390x [14:49:08] 386 [14:53:05] georgekyz: wait I am a bit confused, didn't you say that you were building amd-pytorch25? [14:53:16] the one in the production-images repo I mean [14:53:28] that needs a tool called docker-pkg [14:53:36] you can pip install it in a python venv [14:54:40] I have docker desktop on mac. And I followed the same workflow as I was building the bullseye image. [14:54:51] thnx for your time [14:54:54] georgekyz: o/ you don't have to build the pytorch image since you are just going to use it in the blubber image you are building [14:55:19] yeah but it still fails [14:55:31] I am trying to test it locally [14:55:57] you'd need to run sth like [14:55:57] docker build --target production -f .pipeline/edit-check/blubber.yaml --platform=linux/amd64 -t editcheck:gpu . [14:56:16] (dont copy paste maybe some paths are wrong above) [14:56:49] oh that was my bad, I was runinng the above but without specifying the --platform [14:57:57] still you also need the rest of the arguments to use with blubber (since we're not using a dockerfile directly) [14:58:04] unless I am totally missing sth [14:59:51] nope you do not miss anything, that was my fault. [15:09:39] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642161 (10JMeybohm) [15:09:45] (03PS1) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [15:10:00] (03PS2) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [15:11:50] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642179 (10elukey) [15:13:21] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642196 (10klausman) [15:13:26] cool cool [15:20:49] (03PS5) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [15:49:26] (03PS6) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) [15:52:18] (03CR) 10Ilias Sarantopoulos: "@gkyziridis@wikimedia.org Apologies but I refactored the code to use the functionality we already had under process_utils.py that handled " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:55:55] georgekyz: sorry for the above refactoring but I figured that it would be best to reuse the existing code [15:57:01] that's awesome! Thnx for refactoring, it is better indeed! [15:57:44] (03CR) 10Gkyziridis: [C:03+1] reference-need: multiprocessing in predict (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:57:45] I started from a completely different approach with torch.multiprocessing in the beginning but in the end since I ended with a standard process pool I realized this would be best [15:58:47] nice, I +1 it [15:59:36] btw this is interesting and included in kserve 0.14 https://kserve.github.io/website/latest/blog/articles/2024-12-13-KServe-0.14-release/#introducing-model-cache [16:00:18] it is a model cache using a pvc to avoid fetching the model every time thus minimizing storage initializer delay (designed especially for LLMs) [16:00:59] thanks for the review [16:06:18] So it uses the `LocalModelCache` by default in the newest release ? ~~~~~^ [16:06:36] or should we change anything in the architecture? [16:08:06] from what I read the model cache is disabled by default so you can configure for which models you'll use it [16:08:53] (03CR) 10Gkyziridis: inference-services: edit-check service on GPU. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:08:56] if and when we decide to use we'll need to set up some volumes to be used for sure. I will check what is needed [16:09:33] it would be nice to test it especially on big models [16:09:44] lets see [17:03:24] (03CR) 10Ilias Sarantopoulos: [C:03+1] inference-services: edit-check service on GPU. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [17:03:39] going afk folks, have a nice evening/rest of day! [17:19:36] (03PS3) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100)