[06:51:00] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 10MoveComms-Support, 07Chinese-Sites: Support languages whose add-a-link models were not published - https://phabricator.wikimedia.org/T309263#10640336 (10Aklapper) [Removing outdated 2023 project tag]
[07:34:18] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: include empty country links in tfidf_sum for normalization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970)
[08:14:29] <isaranto>	 good morning!
[08:40:57] <georgekyz>	 good morning all
[08:48:53] <isaranto>	 o/ George
[09:14:39] <georgekyz>	 \o
[09:54:26] <georgekyz>	 I will try to use https://docker-registry.wikimedia.org/amd-pytorch25/tags/ for edit-check on gpu
[09:59:25] <isaranto>	 ack! for cross reference check the huggingfaceserve and llm blubber files if anything else needs to be added to the image
[10:00:00] <isaranto>	 iirc the pythonpath needs to be altered (as in llm image PYTHONPATH: /opt/lib/python/site-packages:/srv/app:/opt/lib/venv/lib/python3.11/site-packages) as the newer blubber versions dont extend the pythonpath (they just replace it)
[10:00:16] <isaranto>	 ping me if you face any issue and we can take a look together
[10:02:04] <georgekyz>	 isaranto: how I can test it ? 
[10:03:07] <isaranto>	 try it locally first. you can still build the image and since to gpu is found it will fall back to cpu
[10:04:13] <georgekyz>	 ok ty
[10:04:33] <isaranto>	 it should work I mean, again if you face any issue lemme know
[10:56:48] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Thanks Kevin for working on this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[10:58:42] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, Ilias!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[10:59:27] <wikibugs>	 (03Merged) 10jenkins-bot: article-country: include empty country links in tfidf_sum for normalization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128323 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira)
[11:07:27] * isaranto afk lunch
[12:21:46] <moritzm>	 there's been a brief spike of lw_inference_reference_need_cluster errors which went beyond the paging threshold (https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&viewPanel=57&var-datasource=eqiad%20prometheus%2Fk8s&var-instance=api-gateway&from=now-1h&to=now&refresh=30s) it went down again, was that caused by some deployment or similar?
[12:26:06] <isaranto>	 moritzm o/ thanks for the ping. there was no new deployment but this has been ongoing for a week. We are working to issue an improvement/fix that will make the service more viable. We'll have sth probably today or tomorrow
[12:30:01] <moritzm>	 ok, thanks!
[12:30:37] <isaranto>	 is there any alert that fired we are missing?
[12:36:49] <moritzm>	 the alert came via the API gateway, which alerts on elevated 5xxs for all the services behind it 
[12:36:51] <moritzm>	 "GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad"
[12:37:41] <moritzm>	 but it's okay, if the underlying issue is being worked on, I'll simpy drop a note the the US timezone on call folks so that they are aware
[13:03:35] <isaranto>	 oh okay, thanks!
[13:24:04] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641633 (10kevinbazira) >>! In T385970#10636566, @Isaac wrote: > as we discussed separately, let's update the code to still inc...
[13:43:40] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641713 (10Isaac) Looks great -- thanks!!
[13:52:51] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[13:53:20] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[13:54:56] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[13:59:42] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[14:00:33] <isaranto>	 folks I need a review in the above, for the reference-need issues. happy to jump in a call in 1h (after a meeting) if anyone needs clarifying questions
[14:03:39] <isaranto>	 i'm going to share some local load testing on the phabricator task in a bit, and then I plan to run more load testing on ml-staging
[14:05:38] <georgekyz>	 I'm on it
[14:05:49] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10641810 (10kevinbazira) >>! In T385970#10641713, @Isaac wrote: > Looks great -- thanks!! super! article-country wikilink-relate...
[14:05:59] <kevinbazira>	 the article-country model-server with wikilink-related predictions is now live in prod --^
[14:11:39] <isaranto>	 \o/
[14:21:28] <wikibugs>	 (03CR) 10Gkyziridis: "Thank you for working on this Ilias." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[14:22:23] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[14:43:41] <georgekyz>	 Folks I am having some issues building this image: https://docker-registry.wikimedia.org/amd-pytorch25/tags/ 
[14:44:12] <georgekyz>	 I am getting:  ```- Requested platform "linux/arm64" does not match result platform "linux/amd64"```
[14:45:01] <georgekyz>	 and when the container runs I am getting:
[14:45:01] <georgekyz>	 ```E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)```
[14:45:44] <elukey>	 how are you building the image? Docker-pkg?
[14:45:50] <georgekyz>	 I tried to built it with multiple ways, local blubber, pulling it locally and then build it, and via Docker FILE
[14:46:06] <georgekyz>	 I am on mac m4
[14:46:16] <elukey>	 yep yep I figured
[14:46:39] <elukey>	 but have you used docker-pkg to build it? 
[14:47:31] <georgekyz>	 I have the docker deamon installed and I just did: ```docker build -t edit_chceck:gpu . ```
[14:47:43] <georgekyz>	 not sure if I answered :P
[14:49:08] <georgekyz>	 the strange thing is that the selected builders in docker engine is `docker-linux` which says that supports: arm64
[14:49:08] <georgekyz>	 amd64
[14:49:08] <georgekyz>	 amd64/v2
[14:49:08] <georgekyz>	 riscv64
[14:49:08] <georgekyz>	 ppc64le
[14:49:08] <georgekyz>	 s390x
[14:49:08] <georgekyz>	 386
[14:53:05] <elukey>	 georgekyz: wait I am a bit confused, didn't you say that you were building amd-pytorch25?
[14:53:16] <elukey>	 the one in the production-images repo I mean
[14:53:28] <elukey>	 that needs a tool called docker-pkg
[14:53:36] <elukey>	 you can pip install it in a python venv
[14:54:40] <georgekyz>	 I have docker desktop on mac. And I followed the same workflow as I was building the bullseye image.
[14:54:51] <georgekyz>	 thnx for your time
[14:54:54] <isaranto>	 georgekyz: o/ you don't have to build the pytorch image since you are just going to use it in the blubber image you are building
[14:55:19] <georgekyz>	 yeah but it still fails 
[14:55:31] <georgekyz>	 I am trying to test it locally 
[14:55:57] <isaranto>	 you'd need to run sth like 
[14:55:57] <isaranto>	 docker build --target production -f .pipeline/edit-check/blubber.yaml --platform=linux/amd64 -t editcheck:gpu .
[14:56:16] <isaranto>	 (dont copy paste maybe some paths are wrong above)
[14:56:49] <georgekyz>	 oh that was my bad, I was runinng the above but without specifying the --platform
[14:57:57] <isaranto>	 still you also need the rest of the arguments to use with blubber (since we're not using a dockerfile directly)
[14:58:04] <isaranto>	 unless I am totally missing sth
[14:59:51] <georgekyz>	 nope you do not miss anything, that was my fault.
[15:09:39] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642161 (10JMeybohm)
[15:09:45] <wikibugs>	 (03PS1) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100)
[15:10:00] <wikibugs>	 (03PS2) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100)
[15:11:50] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642179 (10elukey)
[15:13:21] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10642196 (10klausman)
[15:13:26] <isaranto>	 cool cool
[15:20:49] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[15:49:26] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019)
[15:52:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "@gkyziridis@wikimedia.org Apologies but I refactored the code to use the functionality we already had under process_utils.py that handled " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[15:55:55] <isaranto>	 georgekyz: sorry for the above refactoring but I figured that it would be best to reuse the existing code
[15:57:01] <georgekyz>	 that's awesome! Thnx for refactoring, it is better indeed!
[15:57:44] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] reference-need: multiprocessing in predict (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos)
[15:57:45] <isaranto>	 I started from a completely different approach with torch.multiprocessing in the beginning but in the end since I ended with a standard process pool I realized this would be best
[15:58:47] <georgekyz>	 nice, I +1 it
[15:59:36] <isaranto>	 btw this is interesting and included in kserve 0.14 https://kserve.github.io/website/latest/blog/articles/2024-12-13-KServe-0.14-release/#introducing-model-cache
[16:00:18] <isaranto>	 it is a model cache using a pvc to avoid fetching the model every time thus minimizing storage initializer delay (designed especially for LLMs)
[16:00:59] <isaranto>	 thanks for the review
[16:06:18] <georgekyz>	 So it uses the `LocalModelCache` by default in the newest release ? ~~~~~^ 
[16:06:36] <georgekyz>	 or should we change anything in the architecture?
[16:08:06] <isaranto>	 from what I read the model cache is disabled by default so you can configure for which models you'll use it
[16:08:53] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: edit-check service on GPU. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[16:08:56] <isaranto>	 if and when we decide to use we'll need to set up some volumes to be used for sure. I will check what is needed
[16:09:33] <georgekyz>	 it would be nice to test it especially on big models
[16:09:44] <georgekyz>	 lets see
[17:03:24] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] inference-services: edit-check service on GPU. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[17:03:39] <isaranto>	 going afk folks, have a nice evening/rest of day!
[17:19:36] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100)