[08:46:16] 10Machine-Learning-Team, 10Gerrit, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10hashar) 05Open→03Resolved @thcipriani script is now in Puppet and available via `/usr/local/bin/gerrit-git-gc-timing` I have d... [09:06:26] hello folks [09:13:52] \o [09:14:32] Currently working on an issue with the deployed AWS stuff with Santhosh. Something's going on with normal-width vs full-width commans in Chinese output. [09:15:07] I doubt it's something I broke with our setup vs. Meta's, but rather it's a difference in their model. We'll see. [09:16:54] ack yes, I am currently trying to finish the new docker image [09:17:07] I hoped they didn't deploy but we'll do another one in these days [09:18:42] I don't think they deployed for real yet [09:20:03] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865063 [09:20:09] it was deployed this morning [09:20:28] Oh [09:20:31] Welp. [10:02:41] klausman: the new Docker image looks working fine (I've also added an extra strict chmod 550 to the files used by the model server) [10:02:46] running deploy.py now [10:02:52] ack. [10:03:32] As for image size, I did a check yesterday asking apt for what packages are the largest. The top 15 or so are all CUDA, and then there's the JDK with 150M and then libc [10:04:25] the last build is 9.18GB [10:05:22] https://pastebin.mozilla.org/TsHmOuyq [10:05:28] the numbers are in kB [10:06:00] We could probably remove a few bits, but I doubt it'll make more than a 100M difference [10:06:13] if that [10:06:13] I see [10:06:14] 3.3G usr [10:06:14] 5.4G opt [10:06:31] Yeah, all that Python stuff in in opt. /usr is mostly CUDA [10:06:45] and in site packages [10:06:45] 1.4G nvidia [10:06:45] 3.3G torch [10:06:50] so yeah not much we can do [10:07:05] I recommend ncdu for explorations like this :) [10:07:18] sure [10:07:27] But awkward in docker images [10:08:07] my main goal was to avoid -dev deps etc.. in the final image, and a more secure final set up, this one seems better from the current [10:08:27] are you planning to add a dedicated repo for the dockerfile etc.. in gitlab? [10:08:41] yes [10:08:42] probably also another one for fairseq as well [10:08:47] (both) [10:09:14] ack perfect, can you update T321781 when you have time with next steps etc..? [10:09:46] I think the language team matter from this morning is settled for now, so I'll get to the Gitlab stuff once I've merged your change (and fixed a tiny bug in the lambda. And then do the Gitlab part (once I've read some GL at WMF docs) [10:10:05] ack [10:10:23] Re: image size, there are some question mark still [10:10:36] E.g. there is both /opt/lib/python/site-packages/nvidia/cudnn/lib/libcudnn_cnn_infer.so.8 and /opt/lib/python/site-packages/torch/lib/libcudnn_cnn_infer.so.8 [10:10:43] But they are different sizes [10:11:00] one is 438M the other 774M. [10:11:11] If we don't need both, that would be quite a chunk saved [10:11:24] (and indicative that there might be more savings) [10:12:44] Wouldn't it be "hilarious" if we don't actually need that fat Ubuntu image and could just switch to something like Alpine and pipe etc would handle all the deps just fine? [10:12:50] pip* [10:13:11] I am not really sure, those shlibs are probably tight to some dependencies stated in the files, I wouldn't really mess with them at this stage [10:13:24] or better, I wouldn't know where to start the trim to be honest [10:14:06] the image is indeed 3.2G, maybe we could use something different [10:14:22] but with CUDA stuff it will get bit anyway [10:14:53] Alpine is not used at the WMF so I would not try it, but ideally in the future one of our internal base images should work as well [10:15:00] Aye. Something to explore for a rainy day [10:15:09] in the ideal world of Nvidia drivers compatible with our open source standards [10:15:16] IS your GH PR up-to-date? [10:18:25] 10Machine-Learning-Team: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10achou) The problem is their hosts are not set correctly. **translatewiki-reverted** According to a commit that added translatewiki-reverted model to the editquality repo htt... [10:20:24] elukey: ^^^ re: GH PR [10:27:31] klausman: yep it should [10:27:52] Alright, thanks! [10:36:32] elukey: I think the aws/image and fairseq stuff on GL should be living under an "ML team" kinda group (I don't think it exists yet). [10:37:31] If we create a new group what should we name it? ml-team? [10:40:48] klausman: maybe 'machine-learning' ? What do you think? [10:41:04] yeah, having read the GL naming/grouping policy, that is probably better [10:56:51] 10Machine-Learning-Team: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10elukey) Great summary, thanks for working on this! >>! In T324567#8462862, @achou wrote: > The problem is their hosts are not set correctly. > > **translatewiki-reverted** >... [10:58:25] klausman: given the speed of my connection, it will take some hours to upload the new docker image to ECR - if you want to speed up I can drop it so you can run deploy.py to update staging [10:58:52] Yeah, we can do that [10:59:08] ok stopped my deploy.py run [10:59:22] to confirm: deploy for staging from the state at the tip of my GH branch [11:00:39] +1 [11:01:16] Ok, starting build+ deploy now [11:28:45] Janis finished the first prototype of the new k8s 1.23 setup, IIUC we may have one of the serviceops' staging clusters updated this week [11:28:49] that is amazing [11:28:56] oh, very nice [11:29:58] final layer of the image is currently uploading [11:36:04] Ok, Sagemaker endpoint for staging is updating [11:36:11] This usually takes a few minutes [11:42:10] going afk for lunch, ttl! [11:42:45] ttyl [11:58:47] elukey: found a bug in the staging image: "/opt/ml/model/" needs to be copied to the non-build image, I think. Testing that hypothesis now [12:20:41] Something weird is going on, can't yet tell what [12:22:51] brb [12:45:17] So the new staging setup does not have a proper setup in "some" way [12:45:24] On startup, the _working one says: [12:45:33] will load model /opt/ml/model/wikipedia-distillated archive.wikipedia-distillated-20221116-115059.mar [12:46:04] (03PS1) 10AikoChou: revertrisk: update knowledge_integrity and set publish image tag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/867575 (https://phabricator.wikimedia.org/T321594) [12:46:06] The staging one seems to not have that model name set at all [12:46:38] cf https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fnllb200-staging/log-events/wikipedia-distillated-20221213-125751$252Fi-02fbfc45abcc98826 [12:47:30] + torchserve --start --foreground --ts-config /home/model-server/config.properties --model-store '' --models '' [12:47:51] The ENV directives at the end of the Dockerfile seem to not work correctly. [12:47:58] Will investigate after lunch, bbiab [12:51:35] (03CR) 10CI reject: [V: 04-1] revertrisk: update knowledge_integrity and set publish image tag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/867575 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [13:14:16] (03CR) 10AikoChou: "Verified failed since pipeline 'revscoring' is not defined in project's '.pipeline/config.yaml'" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/867575 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [13:44:38] `java.io.IOException: Unable to create directory /tmp/models` Of *course* [13:47:12] elukey: so it's not forgotten: ARG statements are scoped, so if you have FROM xxx as build and the ARG statements, and then another FROM (like we do), you need to re-do the ARG statements. [14:03:55] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:09:27] klausman: I am back :) [14:10:00] \o [14:10:11] so one ARG is missing? I thought they were needed only at build time [14:10:27] So /tmp is not writable for Java. I checekd the image locally and it's 1777 as it should be. I suspect there may be a tmpfs mounted by AWS [14:10:38] the model_name is needed for example [14:10:42] ahhh ENV MODEL_STORE=${model_store} [14:10:49] Since torchserve uses that to download the tgz from S3 [14:10:58] model_store too, yes [14:11:01] the above probably doesn't work sigh [14:11:24] the above? [14:12:01] the ENV statement works, but since the second FROM clears the ARGs, it's now empty [14:12:04] ENV MODEL_STORE I meant [14:12:08] yes yes [14:12:16] it is empty [14:12:29] same for model_name [14:12:40] I've fixed that in my checkout, but now I am dealing with Java being unable to write to /tmp. Not quite sure yet what is going on there [14:12:57] what is the error that you see? [14:13:08] In my local images, if I inspect it with run -it and bash, I see the perms as 1777 [14:14:37] https://pastebin.mozilla.org/pQ9MoQqv [14:15:01] The key info being `java.io.IOException: Unable to create directory /tmp/models` [14:15:02] just to understand the actual status - you added ENV vars to the production image, and republished to staging? [14:15:25] I added ARG statements after the second from just like after the first one, and republished to staging [14:15:34] s/from/FROM/ [14:16:20] and how I can check the full logs? Is there a quick way? [14:16:43] will send link as DM [14:20:16] I see ok, it is trying to unzip the mar archive into /tmp [14:20:55] yes, and for some reason /tmp is not writable. Currently uploading a docker image with some extra ls/echo statements to see what's going on with tmp [14:21:20] and I guess there is no easy way to have something like nsenter for aws [14:21:22] lovely [14:21:33] If there is, I couldn't find it [14:21:56] Maybe for EC2 or the like, but Sagemaker is a different beast [14:24:18] yeah I can imagine.. and can we easily check the bootstrap logs for the production instance? To see the TMP logs etc.. [14:24:30] I am checking my change to see if I missed something obvious [14:24:52] The log link I sent is all we have. The messages from the entry point are at the very top [14:25:15] (you may need to "load more" a couple of times because AWS) [14:26:42] Ok, everything uploaded, waiting for endpoint restart [14:27:23] yes yes I am checking in the prod instances, to see if I can see the bootstrap logs in there (to spot differences) but so far it seems that I can't get them [14:28:53] navigating AWS logs is pain [14:29:22] especially since it rotates them willy-nilly, so you have to guess which file contains the startup messages from the currently-running instance [14:30:49] ok, got new startup msgs [14:30:55] https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fnllb200-staging/log-events/wikipedia-distillated-20221213-150824$252Fi-05f3381e3650e993f [14:31:05] --- State of /tmp: --- [14:31:08] total 4 [14:31:10] drwxr-xr-x 2 root root 4096 Dec 13 14:30 . [14:31:12] yeah, that won't work [14:31:41] so it was working before since it ran as root [14:31:48] yep [14:32:07] I dunno _why_ /tmp has changed permissions between local run vs. AWS [14:32:20] maybe AWS trying to be secure or sth? [14:32:31] We could just pre-make the dir and set it to 777 [14:33:19] we can also think to use something like /home/model-server/tmp [14:33:36] *If* Java respects TMPDIR or somesuch [14:33:48] Or did you mean a symlink? [14:34:01] nono I meant another dir, I think it is set by torchserve [14:34:25] the /tmp perms are very weird though [14:34:26] https://github.com/pytorch/serve/issues/654 [14:34:48] perfect [14:35:19] I'll update the dockerfile to make "/home/model-server/tmp/" and point TEMP to it [14:35:59] btw, total build-and-upload time for my setup is ~18m [14:36:15] Much better than the hours upon hours it would take you [14:37:32] nice :) [14:37:42] also running as nobody unveils some weird things [14:37:52] idea for a Christmas present for Luca: better interwebz [14:40:31] Hmm. now the build process is using that dir too, since TEMP is a pretty universally-used var. Maybe we could move setting it into the entrypoint script [14:43:59] klausman: the TMP dir needs to be added only to the production image IIUC [14:44:09] not to the build one [14:44:34] That's what I did [14:44:51] what part of the build process uses TMP then? [14:46:15] also I just seen ENV TEMP=/home/model-server/tmp in the build image [14:46:30] and the related mkdir -p /home/model-server/tmp \ [14:46:41] I think that both needs to be moved to "production" [14:46:42] and that's it [14:46:51] klausman: --^ [14:47:01] Huh [14:47:20] So that's what actually broke: we missed moving those to the lower half [14:47:38] yeah [14:48:00] maybe let's also add comments in the dockerfile about their purpose [14:48:02] so we'll remember [14:49:57] yep, doing that [14:50:33] hopefully this is the last one [14:52:39] don't jinz it ;) [14:52:46] jinx* [14:53:06] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) p:05Triage→03High a:03elukey [14:54:22] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira) Model evaluation for models whose training pipeline run successfully has been completed and below are the backtesting results: | | Precis... [14:56:11] elukey: btw, one thing of low priority that I want to change is to make deploy.py a bit more user friendly (logging, error messages, build vs. upload modes). [14:56:16] But that's for the new year [14:56:44] I feel like the less hateful the build process is, the more likely we'll find someone to take it off our hands. [14:56:57] definitely, especially when we'll decide the ownership of the service (spoiler alert - it will be ours) [14:58:59] lalala can't hear you [15:18:50] 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10elukey) a:03achou [15:26:36] 10Machine-Learning-Team, 10Patch-For-Review: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10elukey) a:03achou [15:56:02] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) Trying again to reproduce the problem and it is not as easy anymore. We have restarted a lot of pods recently, so maybe stale/wrong istio settings went away a... [15:58:37] 10Machine-Learning-Team, 10Patch-For-Review: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10elukey) @calbon We have discussed this during the team meeting, and we'd like to remove the above models from Lift Wing. One is not supported by ORES as w... [16:15:27] Found another instance of /code/ in the handler :-/ [16:15:39] So another roundtrip we'll make [16:18:13] ah weird :( where? [16:19:09] loading the dict*txt files, it uses the data member of the arg_overrides dir (line 148 ish) [16:20:04] ah snap [16:39:22] It works \o/. [16:39:50] wowww [16:40:28] when you have a moment can you update the dockerfile? [16:40:35] (so i can check the diffs, curious) [16:40:49] at this point we can ask to CT to test staging again [16:41:21] Writing a commit msg as we speak :) [16:42:05] CHanges commited to my GH branch [16:43:08] https://pastebin.mozilla.org/smDOWe0H/raw [16:43:20] The "????" is because the mozilla pastebin does not like ZH ideograms [16:43:54] ast=Asturian is closely related to Spansih, and that bit LGTM [16:46:24] nice :) [16:46:46] the only doubt that I have is if we need the env variable inside the docker entrypoint (it should be picked up from the docker file in theory [16:49:44] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10klausman) We did some more refactoring/improving of the Docker image today, and have done basic tests. The staging endpoint now uses the new image, and it looks l... [16:50:13] elukey: Oh, yeah good point, I forgot to drop it from the script. Eh, "defense in depth" :) [16:50:28] I've also updated the bug (see above) [16:51:06] I'm heading out now. See y'all tomorrow! [16:51:10] o/ [17:25:59] going afk as well, have a good rest of the day folks