[06:43:37] Guten tag folks! [07:22:46] morning! [07:25:49] kevinbazira: turns out /mnt/models/ doesn't exist at all in that directory. [07:25:59] s/directory/image/ [07:27:10] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9728785 (10klausman) >>! In T362749#9727438, @kevinbazira wrote: > PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras' I've fetched the im... [07:28:24] morning Tobias! [07:28:30] that is weird [07:28:32] klausman: o/ thank you for checking. we don't usually create this dir manually. I wonder what has changed [07:29:01] isaranto: o/ [07:29:05] Do the other Blubber-made images maybe use a different MODEL_DIR? [07:29:13] o/ kevinbazira :D [07:30:06] If I read the error correctly, the code _tries_ to create it (it's Pemr denied, not dir does not exist), so if normally the models live in a place where the final image is writable to the model server, it would probably work [07:30:48] ah, I see other images use the same dir [07:32:57] yes all images use the same dir. the storage-initializer container takes care of downloading the model from swift to /mnt/models/. The logs from this container look ok btw https://phabricator.wikimedia.org/P61000 [07:33:05] The main difference I see is that the Logo detection blubberfile is the only one using bookworm [07:33:30] everyhting else is Bullseye or amd-pytorch21/22 [07:34:15] good point! [07:34:36] pytorch images are based on python3-bookworm image though [07:34:46] But it's a bit wild that a Debian distro change would fail like this [07:37:33] could it just be a file permission thingy? [07:37:51] What do you mean? [07:38:47] Oh you mean the model is there but not rreadable? [07:39:04] yes. I am looking at keras documentation [07:47:17] So as I understand it, the model binary at /mnt/models is put there by the storage initializer, fetched from S3. [07:52:20] that's correct [07:52:34] So if its permissions are wrong/too strict, maybe it happens at that point? [07:52:44] I've compared the permissions of both the revertrisk and logo-detection models and they share the same access control: `mlserve:prod: FULL_CONTROL` [07:52:44] https://phabricator.wikimedia.org/P61002 [07:53:37] Yeah, I think the S3 side is fine, permission-wise, otherwise the storage-init would fail, before the kserve container is even launched [07:54:35] ah. I have a suspicion: between Bullseye and Bookworm, the default umask of users changed, so unless you specify something else, newly created files have stricter permissions in the new image than the old one. [08:02:48] Hrm, no, that's probably not it. The umask in the container (whcih would be the same as the Bookworm default) is 0022, which is permissive enough [08:04:41] kevinbazira: we could tweak the entrypoint.sh script to do an ls -l on the model file, so we at least can see what the permissions are when the container starts. Just to veirfy they're too strict and it's not something else going on [08:05:20] (though I don't know where that file comes from/which repo it lives in) [08:06:35] the model file or blubberfile? [08:06:38] model on Swift: s3://wmf-ml-models/logo-detection/20240417132942/logo_max_all.keras [08:06:38] blubberfile in repo: https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/.pipeline/logo_detection/blubber.yaml [08:10:03] entrypoint :) [08:10:07] but I found it [08:11:10] I'm trying sth on the current deployment is that ok? [08:11:11] So I'd add something like `ls -la /mnt/models` to model_server_entrypoint.sh, build the logo image, bumpt the deployment chart to use the new image and see what the output is [08:11:22] isaranto: fine by me [08:11:27] will be over in ~5' [08:11:35] isaranto: okok [08:17:18] no luck :( [08:17:27] :/ [08:18:49] I noticed that the file I downloaded from https://analytics.wikimedia.org/published/wmf-ml-models/logo-detection/ had some extra attributes so thought ii'd try removing them but didnt work [08:19:16] what kind of attributes? [08:19:42] I noticed these permissions `-rw-r--r--@` and these attributes [08:19:42] ``` [08:19:42] com.apple.metadata:kMDItemWhereFroms [08:19:42] com.apple.quarantine [08:19:42] ``` [08:20:22] yeah, those are MacOS-specific [08:20:51] if I download the file, I get just regular stuff (0644, no specials) [08:21:07] ok then it could be me [08:21:14] but in any case this wasn't it :( [08:21:40] No, the two attrs you have are normal on macos for downloaded file (quarantine is about malware protection, for example) [08:22:02] so it's not just you :) [08:22:45] I found it weird as they don't exist in all the other models I have, but could be due to the file extension [08:24:08] the file is a zip file, I suspect the browser detects that and is extra careful [08:29:08] (03PS1) 10Kevin Bazira: logo-detection: downgrade bookworm to bullseye [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021397 (https://phabricator.wikimedia.org/T362749) [08:29:46] (03CR) 10Klausman: [C:03+1] logo-detection: downgrade bookworm to bullseye [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021397 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:30:27] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021397 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:35:36] (03Merged) 10jenkins-bot: logo-detection: downgrade bookworm to bullseye [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021397 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [08:47:14] Looks like it's stillc crashlooping. so bullseye vs. bookworm wasn't it. [08:50:45] :/ [08:53:27] klausman: I checked `helmfile -e ml-staging-codfw diff` and it shows there's a new image [08:53:35] has this been deployed yet? [08:53:49] ah, I thought you had :) [08:53:55] not yet [08:54:01] lemme do it now [08:54:31] ack [08:54:46] done [08:55:14] status: PodInitializing [08:56:08] ah, still the same error with the new image [08:56:16] yep: `CrashLoopBackOff` [08:56:34] and the logs show the permission error as before [08:57:12] yes yes : `PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'` [08:59:45] The most recent update to the storage initializer was March 1st, if that broke matters, we would have noticed earlier [09:17:16] (03PS4) 10AikoChou: revertrisk: add support for base model's payloads in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) [09:17:45] (03PS5) 10AikoChou: revertrisk: add support for base model's payloads in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) [09:19:40] so I modified the entrypoint so I could attach a shell to the running pod and check the file permissions [09:19:40] https://phabricator.wikimedia.org/T362749#9729015 [09:19:45] they seem ok [09:19:56] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729015 (10isarantopoulos) The directory is being created by the storage-initializer container when the pod initializes. I attached a shell to the running pod and chec... [09:20:08] (03CR) 10AikoChou: [C:03+2] "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [09:20:49] the other difference is that this image uses blubber buildkit:v0.22.0 while others use v0.21.0 [09:20:49] the buildkit change log says nothing about permissions: https://gitlab.wikimedia.org/repos/releng/blubber/-/blob/main/CHANGELOG.md?ref_type=heads [09:27:12] (03Merged) 10jenkins-bot: revertrisk: add support for base model's payloads in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [09:31:27] isaranto: can you also check the perms of the two intermediary directories, like `ls -ld /mnt /mnt/models`? [09:32:04] on it [09:34:31] they're the same https://phabricator.wikimedia.org/T362749#9729015 [09:35:01] I notice that the open call is done with "r+b". That's a problem [09:35:32] `+` means "open for append", which won't work since the file is owned by another user and `somebody` (the user we run as) doesn't have write permissions. [09:35:49] Why keras would need to open the file with append enabled I don't know [09:36:15] you're right but that is hardcoded in keras https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139 [09:37:08] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729074 (10klausman) This is a bug in keras: it tries to open the file with mode `r+b` (read, append, binary), but since the file is owned by another user (`nobody` vs... [09:37:09] Then it's a keras bug [09:37:40] For most users, this won't matter since the owner of the file and the user running it are the same (or the altter is root). [09:38:10] Still, if kerras doesn't really _need_ to be able to append to the file, the mode shoudl just be "rb" [09:39:36] I don't see a "r+b" reference in newer versions https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py [09:41:36] "saving" implies to me that write would be needed, or that they re-used a function for serving [09:42:48] https://github.com/keras-team/keras/blob/04891e89da87c2a433beb12ff7dad59403c71671/keras/src/saving/saving_lib.py#L138 This function looks more reasonable for loading [09:43:04] So yeah, maybe we need to bump the version of keras we use? [09:50:25] I got to run an errand, bbiab [09:50:29] (also lunch) [09:55:55] TIL .keras is the new format for tensorflow models and SavedModel is legacy https://www.tensorflow.org/tutorials/keras/save_and_load [10:03:53] cause I was thinking of switching to another format. but in keras v3 ONLY .keras works [10:09:18] kevinbazira: I'm thinking to try to use latest keras version. wdyt? [10:11:05] isaranto: thanks for the suggestion. I am testing `keras==3.2.0` and `keras-cv==0.8.2` since they are closest to what we've been using and don't use `r+b` but rather `rb`. [10:11:05] (03PS1) 10Ilias Sarantopoulos: logo-detection: bump keras to 3.2.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021883 (https://phabricator.wikimedia.org/T362749) [10:15:16] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729134 (10isarantopoulos) Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode [[ https://github.com/keras-team/keras/blob/master/ker... [10:16:03] a ok, lemme know what works. I managed to successfully build the image with 3.2.1. I think that we've found the issue [10:26:17] on my local machine, I've tested both keras 3.2.0 and 3.2.1 with keras-cv==0.8.2 and the model-server run without any issues. [10:28:54] (03CR) 10Kevin Bazira: [C:03+1] "on my local machine, I've tested both keras 3.2.0 and 3.2.1 with keras-cv 0.8.2 and 0.8.1. the model-server run without any issues." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021883 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [10:29:22] (03PS2) 10Ilias Sarantopoulos: logo-detection: bump keras to 3.2.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021883 (https://phabricator.wikimedia.org/T362749) [10:31:35] (03CR) 10Ilias Sarantopoulos: [C:03+2] logo-detection: bump keras to 3.2.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021883 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [10:31:45] ok, giving it a try! [10:32:21] (03Merged) 10jenkins-bot: logo-detection: bump keras to 3.2.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021883 (https://phabricator.wikimedia.org/T362749) (owner: 10Ilias Sarantopoulos) [10:43:34] great. the new image has been built: `2024-04-19-103235-publish` [10:43:46] yes I am currently testing it on experimental [10:43:52] okok [10:45:38] it is running \o/ [10:47:41] nice work! [10:48:37] nice! \o/ [10:48:38] thanks for pointing it out Tobias! it slipped right through when I was checking the code [10:49:41] however now logo detection won't be able to download images as it has no connectivity to the "outside world" [10:52:01] \o/ [10:56:24] thank you for helping troubleshoot the perms issue klausman and isaranto [10:57:21] <3 [10:57:29] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729256 (10isarantopoulos) Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman! Now in order for requests to work we'd need to give the model serv... [10:58:05] the model-server is now running in staging but can not access images :/ [11:00:28] yes this is what I wrote above --^ [11:04:13] thanks. I am going to ask Marco to provide sample URLs from the commons upload stash so that we can enable the logo-detection model-server to access them from LW/k8s [11:06:55] Nice! [11:07:00] * isaranto lunch [11:17:19] * aiko lunch! [11:35:14] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729293 (10kevinbazira) Thank you for your help in troubleshooting and resolving the permissions issue, @klausman and @isarantopoulos! [11:36:13] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9729294 (10kevinbazira) @mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources... [12:16:38] as I previously manually patched the isvc to test the new image I sent a new patch to update it https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1021908 [12:17:06] this one is a no-op as it is already deployed like that [12:31:06] hello folks [12:31:57] 'ello [12:32:48] kevinbazira, isaranto: now that the permission issue is fixed, should we give bookworm another try? [12:33:04] o/ Luca [12:33:17] yes! doesn't make sense to keep bullseye [12:33:32] yep [12:33:56] pushing the patch in a bit [12:34:06] Ack! [12:42:13] (03PS1) 10Kevin Bazira: logo-detection: upgrade bullseye to bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021399 (https://phabricator.wikimedia.org/T362749) [12:42:51] (03CR) 10Ilias Sarantopoulos: [C:03+1] logo-detection: upgrade bullseye to bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021399 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:43:30] (03CR) 10Kevin Bazira: [C:03+2] logo-detection: upgrade bullseye to bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021399 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:44:26] (03PS1) 10Elukey: revscoring_model: fix typo in error msg [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021913 [12:45:44] isaranto: I filed a patch to fix the typo, after that wdyt about adding the json payload logging to all revscoring isvcs? [12:46:24] definitely! [12:46:32] (03Merged) 10jenkins-bot: logo-detection: upgrade bullseye to bookworm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021399 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [12:46:58] (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring_model: fix typo in error msg [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021913 (owner: 10Elukey) [12:47:44] I'll create the patch for deployment-charts [12:47:55] I have one to go don't worry [12:48:08] we can use the istio mw api endpoint upgrade to roll them out [12:48:22] we'll have to deploy anyway [12:48:37] ok! [12:48:37] (03PS2) 10Elukey: revscoring_model: fix typo in error msg [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021913 [12:53:17] isaranto: when you have a moment could you please also check https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021506? [12:54:06] ah yes! [12:55:37] nice catch! I forgot to remove these when we removed the deployments from staging [12:56:34] thanks! [12:57:23] at the beginning I was worried that I broke some model server with the istio changes in staging [12:57:43] likewise! [12:57:50] (goes to the thanks) [13:01:11] i noticed I messed up updating the wrong image https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1021789 [13:04:18] you are sabotaging yourself :D [13:11:15] yes I tried to quickly merge the previous one so that I would then rebase Kevin's patch [13:11:42] (03CR) 10Elukey: [C:03+2] revscoring_model: fix typo in error msg [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021913 (owner: 10Elukey) [13:12:29] (03Merged) 10jenkins-bot: revscoring_model: fix typo in error msg [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021913 (owner: 10Elukey) [13:12:39] aiko: o/ [13:12:51] anything against me messing with RR-LA in staging? [13:13:39] fine by me! [13:29:33] ok folks with the new istio config we are able to use the current "proxied" config for MW (namely having to specify the WIKI_URL) but also to just run without it (using the FORCE_HTTP flag to be able to have metrics about HTTP return codes etc..) [13:30:02] in theory if we move to this new scheme in the future any change to the backend will just be a single change in the istio config [13:30:11] rather than a cascade of deploys [13:32:05] we could in theory also add a workaround for the redirects like yue -> zh-yue [13:32:24] but we'd probably need some tweaking of aiohttp, since by default it follows redirect [13:32:55] we should have an helper function to check the redirect, and use HTTP instead of HTTPS [13:33:04] elukey: o/ [13:33:04] this would allow us to avoid to keep the map [13:33:06] klausman, isaranto: the new logo-detection image with bookworm and keras 3.2.1 is up and running: [13:33:06] ``` [13:33:06] NAME READY STATUS RESTARTS AGE [13:33:06] logo-detection-predictor-00014-deployment-67f7499fbf-9zzp8 3/3 Running 0 108s [13:33:06] ``` [13:33:14] kevinbazira: o/ [13:33:36] nice! [13:35:32] neat! [13:42:45] elukey: ack on the istio changes. I'll follow up on the task/patches and ask questions [13:43:25] I am planning to do a little presentation, it is a big change, hopefully for the better [13:59:32] elukey: no problem [14:01:33] ah no ok I need a little more config to make both cases to work [14:06:31] aaand it seems that I am able to avoid the proxy also for the revscoring isvcs, without changes! [14:06:57] niiice [14:06:57] * elukey dances [14:07:11] breakthrough friday! [14:13:21] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in ml-staging - https://phabricator.wikimedia.org/T362984 (10isarantopoulos) 03NEW [14:13:57] klausman: I created a task so that we can track the GPU issues that we had [14:14:03] aye [14:16:54] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9729637 (10isarantopoulos) [14:18:07] I added software/driver versions. I'd like to see if I can run something with a pytorch+rocm5.4.2 but it can't be the same thing we want to run as kserve restricts it (kserve needs at least torch 2.1.2 and there is no python wheel for 2.1.2 + rocm5.4.2 [14:22:18] isaranto: I don't think the problem is related to the rocm version of the k8s node, it smells like something related to ROCM itself (maybe 5.5 is not enough for MI100, or there were some issues) [14:22:33] ack [14:24:39] we could try with 5.6 as it is available for torch 2.1.2 https://download.pytorch.org/whl/rocm5.6/torch/ [14:25:52] I am trying to create a venv on ml-staging2001 with torch + 5.5, to see if it works [14:26:52] ah not enough space on the root partition [14:26:53] sigh [14:27:16] of course we install the rocm drivers.. [14:28:30] also `somebody` isn't in the video group when I ran `groups` on the running pod. Unfortunately I cannot check the same for the nllb pod as I dont have perms [14:30:15] I remember Tobias asking about it but didnt check at the time [14:32:11] (03Abandoned) 10Ilias Sarantopoulos: revscoring: customize kserve logs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/964568 (https://phabricator.wikimedia.org/T333804) (owner: 10Ilias Sarantopoulos) [14:33:31] elukey: I deleted another gig or so fo stuff from my homedir to make room [14:39:27] klausman: no problem thanks, I don't think it fits with all the GBs needed :( [14:42:02] isaranto: in theory the video group shouldn't matter since we allow the device to others [14:44:39] the other thing that I am wondering is where does the error come from [14:44:52] in theory from the ettor msg it seems from libdrm [14:45:36] https://github.com/grate-driver/libdrm/blob/master/amdgpu/amdgpu_device.c#L80 [14:45:49] Agreed. Strace sorta supports that. What I find a bit odd is that it first tries access(), which fails with EPERM, but ti then continuies to try the ioctl anyway, getting EACCESS [14:45:59] and https://packages.debian.org/search?searchon=names&keywords=libdrm-amdgpu1 shows that bullseye version is different from the bookworm one, maybe something new? [14:47:24] Interesting.. [14:51:29] * isaranto afk. be back in an hour [15:15:59] 06Machine-Learning-Team, 13Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9729728 (10elukey) After some tests I was able to find an istio configuration to support both transparent and non-trans... [16:11:59] * klausman heading out now, have a nice weekend, everybody! [16:16:26] I'm back for a short while [16:16:35] Have a nice weekend Tobias! [16:21:08] Going afk as well, have a nice long weekend folks! Remember that Monday is holiday [16:23:09] have a nice weekend! :) [16:36:06] ciao Luca o/ [16:56:18] going afk for the weekend. cu folks <3 [17:02:24] logging off for the weekend as well o/