[01:22:42] 10artificial-intelligence, 10MediaWiki-extension-requests, 10Stewards-and-global-tools, 10Outreachy (Round-15), 10User-Tgr: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA) - https://phabricator.wikimedia.org/T158909 (10Tgr) [06:24:41] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira) [06:40:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira) 21/22 models were trained successfully in the 8th round of wikis. The Western Frisian Wikipedia (fywiki) returned the error in the screen... [08:03:04] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [08:24:29] hello folks :) [08:26:50] isaranto: o/ I agree on having a more consolidated set of rules to follow for our Python code, but let's start with a proposal in task so we can discuss something concrete. The CI / git-hook is a good one, we already use "black" to have a minimal set of rules to follow, but we could add more (for other libraries like spicerack we have a ton more, see [08:26:55] (https://gerrit.wikimedia.org/g/operations/software/spicerack/+/refs/heads/master) [08:27:25] for the design patterns same thing, let's start with a proposal about code refactoring etc.. so we can have an example and work on it [08:27:51] otherwise I feel that we get together in a meeting, we agree on doing something "better" but we don't really take any decision [08:41:43] (03CR) 10Elukey: "Added some notes, great start :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [08:50:47] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) Results for MP for drafftopic with the increased resources (4GB memory instead of 2) - They don't seem to be any better | | model | quantile | no_r... [08:55:26] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) [09:10:59] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) The new image looks better in term of size: ` wikipedia-distillated-20221207-105531 latest... [10:16:36] wow uploading a new image from my laptop to ECR takes ages [10:17:24] the upload seems heavily throttled (I expected some but not this much) [10:21:27] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) One thing to check - In `kubectl pod describe etc..` of staging pods I noticed, last week before the increased memory size change, that some OOM were registered. Let's... [10:29:06] elukey: what kinda bw do you see? [10:29:11] also \o :) [10:30:43] it seems some KB/s, I don't have a lot of upload bw but this seems way slower than expected [10:31:00] Yeah, that sounds broken. I don't remember uploads taking that long. [10:34:46] (03CR) 10Hashar: "recheck after https://gerrit.wikimedia.org/r/c/integration/config/+/866570" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [10:36:01] I opened https://github.com/klausman/stopes/pull/1 in the meantime [10:36:10] we can merge if/when staging wokrs [10:36:12] *works [10:36:23] ack [10:40:31] (03CR) 10Klausman: blubber: create universal revscoring image (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [10:56:18] elukey: I am also investigating whether we actually still need Java in the image. I somehow doubt it. [10:58:19] klausman: I am 100% sure we don't, it was probably convenient for testing [10:58:37] everything runs as python and I don't see a trace of jvm running [10:58:41] I also vaguely remember it being a Java wrapper before it became torchserve [10:59:04] Also: [10:59:06] nobody@9a392431c414:/home/model-server$ find / -iname \*.java 2>/dev/null [10:59:08] nobody@9a392431c414:/home/model-server$ [10:59:31] so we can ditch opnejdk from the apt install list as well [11:02:13] yep yep, it is currently not a big deal since it is only installed in the build image (that is discared at the end) [11:03:23] I never got the image to work properly locally ('/opt/ml/model/wikipedia-distillated/*' being empty), did you manage to do so? [11:03:53] nope same thing [11:04:12] I guessed that the dir is mounted when running the image on aws [11:04:21] but I didn't investigate on deploy.py [11:04:47] Yeah, I have the same theory [11:33:10] * elukey lunch! [11:54:08] That sounds like a splendid idea [14:41:14] elukey: I played around with PR/1 and it seems torchserve is not showing up in the image. Whereas the olde rimages have that /opt/ml/model problem, the image that I just built doesn't start with a different error: [14:41:16] /usr/local/bin/dockerd-entrypoint.sh: line 19: torchserve: command not found [14:41:18] klausman: o/ so deploy.py almost finished (it is uploading tarballs to s3, same bw issue, but the staging endpoint is up) [14:41:29] Talk about timing :) [14:41:55] that is strange, staging seems to work [14:41:56] might also be a PATH issue [14:42:01] I tested the endpoint now [14:42:19] I might also have broken my local checkout, so if it works on staging... [14:42:35] weird, lemme recheck [14:42:41] in theory ENV PATH="/opt/lib/python/site-packages/bin:${PATH}" PYTHONPATH="/opt/lib/python/site-packages" should take care of everything [14:43:07] AWS console says the staging endpoint was last updated Nov 24 [14:43:32] And the endpoint configs are from the 24th as well [14:44:11] ("Sagemaker > Inference > Endpoints" and "...> Endpoint Configurations") [14:44:26] ahhh so deploy.py does its magic after uploading the tarballs? [14:44:35] Yes [14:44:53] It sorta has to, since it updates the endpoint config which points at new tarballs and images [14:45:02] By default it only touches staging [14:45:11] all right then it may be an issue [14:45:20] I don't see /opt/lib/python/site-packages/bin [14:45:24] in the image indeed [14:45:33] where I think torchserve should be [14:46:17] Yeah, there isn't even /opt/lib [14:46:40] But it wasn't there in the older images either [14:46:54] root@555748de1740:/home/model-server/code# which torchserve [14:46:56] /usr/local/bin/torchserve [14:47:04] (this is wikipedia-distillated-20221124-163631:latest) [14:47:06] nono /opt/lib/python/site-packages is there [14:47:13] ah, right [14:47:22] It wasn't there in the old image [14:47:34] yes it is more or less what blubber does [14:47:39] I presume this is due to moving to a --user install? [14:47:52] I used --target but yes this is the idea [14:47:57] so we can copy only /opt [14:48:00] and leave the rest out [14:48:07] Aaah, now I get it :) [14:49:21] I think it is only a matter of copying the torchserve file to the production image [14:51:23] I am wondering where the script comes from though [14:53:23] Wouldn't it come wih with the pip install of torchserve? [14:53:30] s/wih/in/ [14:53:46] yeah but under usr/local/bin? never seen something like that [14:54:39] well let's try, going to rebuild the image and see [14:54:43] talk to you in 20 mins :D [14:55:11] 20m is very long. With your changes, my laptop builds the whole thing in ~7m (no uploads) [14:55:26] Do you have a v6 route? maybe something is b0rk there. [14:56:37] I never really timed, but it takes a ton of time when downloading the gigantic pypi packages (some of them are ~2G in size) [14:56:51] so yeah if you want to give it a go [14:57:22] the PR should be updated [14:57:36] for example [14:57:37] Collecting torch==1.12.1+cu113 Downloading https://download.pytorch.org/whl/cu113/torch-1.12.1%2Bcu113-cp38-cp38-linux_x86_64.whl (1837.7 MB) [14:57:43] this is sooo long [14:58:57] Mh. I may be privileged with gigabit Internet [15:00:08] Or it's another symptom of whatever is making the uploads slow [15:00:20] yeah I don't have a gigabit :D [15:00:43] Do we have a datacenter machine we could be doing this on? [15:02:07] ideally build2001, but this is not really open-source/production material [15:02:18] Aye [15:04:18] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) [15:08:15] Save you some time: [15:08:18] DEBUG:build.docker:COPY failed: stat usr/local/bin/torchserve: file does not exist [15:08:26] elukey: ^^^ [15:09:29] lovely [15:15:02] I am separating out the torchserve install and will run it with --verbose, maybe there will be some useful output [15:15:53] !vi [15:15:55] oops :) [15:16:07] yeah I also see #!/usr/bin/python in those files, so we need the alias for python -> python3 as well [15:16:56] I think that's python-is-python3 as a package [15:17:29] yeah could be an option, lemme add it [15:25:31] it should be https://github.com/pytorch/serve/blob/master/setup.py#L183 [15:27:04] .DEBUG:build.docker: changing mode of /home/model-server/tmp/pip-target-hcg8u5zq/bin/torchserve to 755 [15:27:30] so it is created and +x'd during the install, but I'm not sure yet if-and-when it gets ignroed or removed [15:28:52] Hm, no context [15:31:40] https://stackoverflow.com/questions/26476607/how-do-you-specify-bin-directory-for-pip-install-with-target-option-enabled [15:31:48] so py3.9 does the right thing [15:31:56] but we are on 3.8 of course [15:32:43] should we just try with py3.9? [15:32:49] Absolutely [15:32:56] I was about to suggest that :) [15:34:17] running with 3.9 now [15:34:53] updated the PR [15:34:55] as wel [15:40:43] root@ecf542b33df6:/# ls /opt/lib/python/site-packages/bin/ [15:40:43] futurize pasteurize torchserve wheel [15:40:56] klausman: --^ (tried manually with pip install etc..) [15:41:06] so py3.9 should do the right thing [15:41:28] Neat. I got another failure, but I was messing with images, so maybe I caused it. lemme rrun [15:41:39] Error was: [15:41:42] DEBUG:build.docker: fairseq/clib/libbleu/module.cpp:9:10: fatal error: Python.h: No such file or directory [15:41:44] DEBUG:build.docker: 9 | #include [15:41:46] DEBUG:build.docker: | ^~~~~~~~~~ [15:41:48] DEBUG:build.docker: compilation terminated. [15:41:50] DEBUG:build.docker: error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1 [15:42:04] yeah that is probably python3.9-dev missing [15:42:25] I have updated the PR if you want to test that [15:42:36] yeah, pulled your changes and running them now [15:46:28] Fairseq build still failed, same error :-/ [15:47:40] ah, this is from libpython-dev [15:48:09] Or more precisely, libpython3.9-dev [15:48:45] I guess that used to be part of python3.x-dev for x<9 [15:49:43] could be yes! [15:51:04] running with libpython3.9-dev in the packagelist now [15:54:06] nope, that wasn't it. [15:55:12] I think fairseq is hardcoding 3.8 paths [15:58:55] there is nothing that mentions it in https://github.com/facebookresearch/fairseq/blob/nllb/setup.py [15:58:58] weird [15:59:06] I think this may need sudo update-alternatives --set python /usr/bin/python3.9 [15:59:21] I suspect 3.8 is still installed and the default [16:00:56] so I have installed on a test docker image python3.9 and python3.9-dev [16:00:57] root@8932fc01b77b:/# dpkg -S Python.h [16:00:57] libpython3.9-dev:amd64: /usr/include/python3.9/Python.h [16:01:46] the failing include is for Python.h, so in theory update-alternatives shouldn't really be concerned in here [16:01:52] Yes, the file is there, but I still see mentions of python38 in the torchserve build log [16:02:33] fg [16:02:39] gah :) [16:03:05] interesting, in my container if I run python or python3 it doesn't really run [16:03:25] there is only python3.9 [16:03:28] I don't see 3.8 installed [16:03:49] I'm not sure yet if that's it. Waiting for this run to get a c&p of thr 34 mentioned [16:03:52] 38* [16:04:39] ok so https://github.com/klausman/fairseq/blob/wmfnllb/setup.py#L201 makes more sense [16:04:52] aaah [16:05:19] Love that it doesn't mention 3.9 or 3.10, but also doesn't restrict away from them [16:05:40] AIUI the classifiers are metadata only, not actually informing dependency resolution [16:05:49] but https://github.com/facebookresearch/fairseq#requirements-and-installation seems not suggesting that 3.9+ doesn't work [16:06:26] aha, python3-pip pulls in 3.8 (or rather: unsintalling *python*3.8* removed pip3...) [16:07:53] I don't think there is a py39-using pip3-package (apt) for this version of Ubuntu [16:08:09] uffffff [16:08:10] yeah [16:08:18] Let me try a horrible hack [16:09:01] the solution is probably to use get-pip.py [16:11:10] Trying that now [16:11:39] What's the idiomatic way of checking a the checksum of a wget'd file? [16:11:45] fg [16:11:52] gah, windows focus hates me today [16:13:20] there is also 11.8.0-cudnn8-runtime-ubuntu22.04 [16:13:38] what do you mean? [16:13:56] it is a more up-to-date base image version [16:14:08] I am testing it, pretty sure it runs a more decent version of python by default [16:14:51] it ships with 3.10 afaica [16:14:54] *afaics [16:14:56] ack, I'll try get-pip in parallel [16:18:01] let's try get-pip.py first, if it works we can probably use it for the moment [16:18:20] it is less invasive than changing the base image, even if in theory it should only be better [16:18:40] so python3.9 get-pip.py first, that should do the right thing [16:19:06] I can test 11.8.0-cudnn8-runtime-ubuntu22.04 too [16:21:05] 22.04 seems based on Debian bookworm [16:22:11] and in https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=cudnn8-runtime-ubuntu2 I can see only 20.x and 22.x [16:23:56] Somethign weird is going on [16:24:12] get-pip should only need a py39 base install, right? [16:24:47] in theory yes [16:24:54] It keeps bombing with: [16:24:57] ModuleNotFoundError: No module named 'distutils.cmd' [16:25:17] (I think. Currently narrowing down the actually failing command) [16:25:43] fg [16:25:48] grrr [16:25:58] ah yes I think you need python3.9-distutils [16:26:17] klausman: --^ [16:26:25] ack [16:28:25] (03CR) 10AikoChou: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:28:56] ok, making progress [16:38:18] trying to build with ubuntu 22.04 in the meantime [16:39:37] Oh, build complete, let's see what running it does [16:40:01] /usr/local/bin/dockerd-entrypoint.sh: line 19: torchserve: command not found [16:40:04] oh noooo [16:41:07] in theory there should be a bin dir under /opt/lib/... [16:41:11] can you check? [16:41:58] there is only /opt/lib/python [16:42:05] (and then site-packages [16:42:07] ) [16:42:14] yeah under site-packages I meant [16:42:35] /opt/lib/python/site-packages/bin [16:42:57] yes! [16:43:06] but only torchrun there, no torchserve [16:44:40] very weird [16:47:58] the build with ubuntu 22.04 fails for some deps, sigh [16:48:01] Um. I have a very bad feeling about this. [16:48:21] about what? [16:48:23] https://pytorch.org/serve/torchserve_on_win_native.html mentions Java [16:48:45] I know it's bizarre, but. [16:49:14] but that is windows though [16:49:31] I'm gonna try adding the jdk back, just to be sure [16:49:58] maybe the presence of the JDK triggers build of more components or sth [16:50:17] I doubt it but worth to check [16:50:28] If only to put my mind at ease :) [16:50:29] in my test earlier on torchserve was created under the bin dir [16:50:53] maybe it is the pip3 install --upgrade for fairseq [16:53:58] But how would that install/upgrade -remove_ torchserve? [16:54:13] otoh, that's not much more bizarre than my jdk hypothesis [16:56:16] Yeah, the JDK bit wasn't it [16:56:21] I suspect that --upgrade may play with the bin dir [16:57:31] if I install torchserve manually following the command in the Dockerfile I get the bin dir populated [16:57:32] I'm doing an ls /opt/lib/python/site-packages/bin/ between the rest of the commands and the fairseq install now [16:57:38] (also SRE mtg in 3m) [17:07:49] elukey: you were right. Something in the fairseq build breaks the bin dir [17:08:21] https://phabricator.wikimedia.org/P42674 [17:09:07] klausman: have you tried without --upgrade by any chance? [17:10:35] not yet [17:11:34] Having goo meet open als slows down the build massively. I shouldve done this on my main workstation [17:17:50] Ok, dropping --upgrade helps with the bin clobbering, but we'll have to see if something else breaks [17:18:29] At worst, we could re-do the torch install after the fairseq build? [17:18:34] torchserve* [17:19:52] DEBUG:build.docker:Step 23/33 : COPY --chown=65534:65534 --from=build ["/opt/lib/python/site-packages", "/opt/lib/python/site-packages"] [17:19:53] DEBUG:build.docker:invalid from flag value build: pull access denied for build, repository does not exist or may require 'docker login': denied: requested access to the resource is denied [17:19:55] ???? [17:20:53] never seen it, seems a local issue? [17:21:59] yeah, rerunning it [17:30:22] nope, reproducible. I think it's a permission issue somewhere below "/opt/lib/python/site-packages" [17:31:45] I am trying to build the image as well, to see if I get the same error [17:32:24] it is a little strange, I don't think that removing --upgrade created this [17:44:57] DEBUG:build.docker:WARNING: Target directory /opt/lib/python/site-packages/bin already exists. Specify --upgrade to force replacement. [17:45:10] ^^^ I think that is the culprit for the missing torchserve stuff [17:45:23] (this message appears without --upgrade) [17:46:07] yeah definitely! [17:46:50] Could you verify that dropping --upgrade works fine for you? I can reproduce the "pull access denied" failure if I do that [17:51:54] yep it is running now [17:59:40] klausman: works for me [18:00:06] yeah, nfc what broke there. my run complete as well now [18:00:15] + torchserve --start --foreground --ts-config /home/model-server/config.properties --model-store '' --models '' [18:00:18] java not found, please make sure JAVA_HOME is set properly. [18:00:21] ahahahahahah [18:00:21] * elukey cries in a corner [18:00:36] * elukey adds java [18:01:37] but torchserve works :) [18:01:51] doing a build with java and no --upgrade now [18:02:02] last try before I'm done for today :) [18:05:13] I added openjdk-11-jre-headless to the production image's apt installs [18:05:38] and updated the PR [18:06:43] ah nice docker build is only redoing production [18:06:47] I should get the results in a bit [18:08:47] there are some perms issues to sort out [18:08:48] Could not create directory /home/model-server/logs [18:09:05] weird. [18:09:14] like && mkdir -p /home/model-server/logs \ [18:09:19] I just added openjdk back, but on run -it, I still get a ajava not found [18:09:47] did you add it to the production bit? (not the build one) [18:09:54] oops :) [18:16:29] also python-is-python3 brings in python3.8 [18:16:51] I'll add update-alternatives --install /usr/bin/python python /usr/bin/python3.9 1 [18:17:07] ack [18:21:20] ok now it works, I can see the model server running [18:21:42] sadly the image is around 9G now [18:21:49] (I have updated the PR) [18:22:14] yay, I have a running image. boo re: 9G [18:22:21] I'll do the last checks tomorrow morning, then we can retry staging :) [18:22:41] aye, capitano. have a nice evening! [18:23:11] you too! [18:23:15] * elukey afk [18:23:25] have a nice rest of the day folks [18:24:08] I'm making another build with ncdu, see where all that diskspace went [18:24:15] cya tomorrow [18:25:24] java is surely a big one (sad me) [18:33:33] not rrally. CUDA is [18:35:00] the jdk is 164M