[06:27:57] mooorning [06:43:16] 06Machine-Learning-Team: Run load tests for the rec-api-ng and update production resources to meet expected load - https://phabricator.wikimedia.org/T365554 (10kevinbazira) 03NEW [06:45:57] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9819799 (10kevinbazira) Thank you for sharing the esitimates @Pginer-WMF, we are goin to {T365554} [08:48:49] good morning! [08:54:47] hey Aiko! [08:58:44] * isaranto afk - doc appt and lunch [09:55:07] * klausman early lunch [10:28:24] klausman: o/ could you take care of publishing this image when you're back ? https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1032725 [10:28:57] or sometime later anyway [10:56:14] isaranto: do you want to wait for Luca's reply on the open comment thread? [10:57:13] ok! [10:58:26] I'll do the other bits (local download of your patch and a test build now, so it should be quick once that's decided) [11:01:01] I'll also give upx another shot on the big .so's you listed [11:01:17] ack, thanks! [11:37:25] 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: Use multilingual revert risk model in Automoderator on supported wikis - https://phabricator.wikimedia.org/T365581 (10Samwalton9-WMF) 03NEW [13:10:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] logo-detection: process image objects instead of image URLs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [13:10:44] the patch --^ is a merged one. just resolved a comment so it would clear from my gerrit dashboard [13:14:45] elukey: o/ is it ok if we proceed with https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1032725 ? Tobias can publish the image if you are ok with it [13:17:26] yep! [13:18:04] Allright, I shall proceed [13:21:12] hello [13:21:13] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9821016 (10Jhancock.wm) 05Open→03Resolved no new errors. [13:21:39] hi Chris! [13:21:43] thanks both! [14:28:50] folks I think that we are ready to reimage ml-staging2001 [14:31:26] the packages that we need should be in bookworm-wikimedia [14:37:16] shall we upgrade? [14:37:56] klausman, isaranto --^ [14:48:40] No objections from me [14:50:32] the only dobut that I have is about the draining, since we don't have capacity on the other node [14:50:42] +1 from me as well [14:51:16] trying to drain, let's see if it works [14:56:13] done! [14:58:12] aaand started [14:58:58] 06Machine-Learning-Team, 13Patch-For-Review: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9821836 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ml-staging2001.codfw.wmnet with OS bookworm [14:59:40] I am having an odd issue with the rocm/troch23 image. It built fine it has been hanging at "publishing" for about 1.5h now the process is still sending data around, but I don't know if it will ever complete [15:00:14] ah yes it takes ages [15:00:27] Alsright, I'll keep staring at the logfile :) [15:00:31] leave it running for a while, I think we are throttled [15:10:00] Mh, I think this is a recurring failure. from `journalctl -xeu docker` on the build host: [15:10:07] May 22 14:41:47 build2001 dockerd[624]: time="2024-05-22T14:41:47.466163845Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error" [15:11:39] you can then check in the nginx logs on the registry nodes [15:11:48] if it failed to push, you'll find a note in there [15:11:49] yeah, in the process of doing so [15:12:21] ENOSPC, the image is too large for the tmpfs [15:12:41] 2024/05/22 15:03:07 [crit] 27529#27529: *247649 pwrite() "/var/lib/nginx/body/0000003174" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch23/blobs/ ... [15:13:25] I'll ctrl-c the build, this won't ever complete [15:13:39] something is off though, we uploaded the llm image with rocm 6.0 [15:14:15] I straced the uploading process and it mentioned a total size of 15GB, but of course I don't know if that's compressed or not [15:14:48] nono it is not, the layers are gzipped in transit [15:15:22] yeah, it almost completed, too, so I figured it must be only a bit over 4G in the relevant dimension [15:16:30] I'll makle another local build, see if I can find a few dozen MB or sth we can save on [15:17:27] ouch... [15:17:56] I am still very puzzled, the last llm image on the docker registry has rocm 6 [15:18:04] and we install more python packages [15:18:11] so either we have some difference in there [15:18:17] or something extra has been added [15:18:30] I'll do some spelunking with ncdu etc [15:19:02] yes it is weird indeed. on my side I'll look at the 2 images (llm vs pytorch base) [15:27:28] What is ~/.cache/pip used for? It's 2.1G in the image. Is that the equivalent of a venv and we need to keep it? [15:29:37] We might want to add --no-cache-dir o all pip calls [15:31:58] this is a good point, almost surely blubber does --no-cache-dir or similar [15:32:13] trying a build with that now [15:33:53] that should be it (or at least one of the issues). the llm image is ~1GB smaller [15:37:43] I totally missed that... [15:41:39] we all missed that :) [15:41:46] even in previous images I think [15:41:50] I'm rebuilding it now to check it and I'm going to open a new patch. shall I use the same version or new one in the changelog? [15:42:26] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9822127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ml-staging2001.codfw.wmnet with OS bookworm completed: - ml-staging2001 (**PASS**)... [15:42:54] ml-staging2001 is ready, new pods are spinning up.. [15:42:56] fingers crossed [15:43:57] 🥁 [15:44:43] did a build with the flag mentioned and the .cache dir is nowhere near the top users anymore [15:44:58] (it's actually gone entirely) [15:45:10] nice find Tobias! [15:45:10] du / now says 13246348 [15:47:13] isaranto: do you want to make the patch for the prodimage repo or should I? [15:47:22] (for the flag addition, that is) [15:47:30] I can do it! [15:47:46] roger! will review as soon as I see it [15:48:10] shall I just use the same version so that we never push the previous version anyway? [15:48:16] I mean use the same changelog [15:49:16] I dunno if that will confuse the buildhost [15:49:36] isaranto: nllb is up and doesn't show the amd init errors [15:49:39] \o/ [15:49:43] oh my! docker image is 13.5GB and gzipped one is 2.5GB (vs 4.8) [15:49:50] \o/\o/\o/\o/\o/\o/\o/\o/\o/\o/ [15:52:14] yayyyy \o/ [15:52:15] btw, ml-staging is also running with the new config, namely no rocm packages installed [15:52:19] klausman: --^ [15:52:42] Roger! [15:52:43] so if nllb works fine, we can only worry about rocm for training nodes [15:53:00] Nicve work, both you and Ilias :) [15:53:21] 06Machine-Learning-Team: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9822160 (10elukey) Everything seems to work as expected, the ROCm packages are not needed! [15:53:32] great work figuring out that gnarly bug [15:53:52] yeah, EBPF messing with permssions is subtle [15:58:10] here it is https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1034975 [15:59:36] on it [16:01:11] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9822198 (10isarantopoulos) We had forgotten the `.pip` dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big... [16:02:51] elukey: is it ok if I remove the gpu from nllb deployment (and the nllb deployment since we don't need it) so I can start messing around with mistral? [16:03:34] actually I'll do it in the morning as I'm wrapping up for today [16:03:45] but I'm super excited folks 🎉 [16:03:58] yes! great stuff happening today :) [16:04:06] have a nice evening, Ilias [16:04:47] elukey: unless you have any objections, I'll approve&merge the new patch (see above) and run the build+publish [16:06:00] go ahead :) [16:06:07] isaranto: +1 yes [16:06:15] maybe check if it works first [16:06:16] elukey: did you change anything else when you deployed nllb or everything should just work straight forward now? I'm gonna start deploying stuff in the morning so wanted to ask [16:06:18] so we have a datpoint [16:06:28] nope changed nothing [16:06:32] super [16:06:41] thanks Luca <3 [16:08:19] tomorrow I'll wake up with so much energy for work! logging off o/ [16:09:23] that's great :D bye Ilias! [16:21:19] uplaod progress now also shows a total that's in line with savings: 13G and a bit [16:43:20] Build and upload done \o/ [16:48:17] heading out now, have a nice evening, everyone [16:50:35] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9822542 (10klausman) `lang=plain # build-production-images --select '*pytorch23*' == Step 0: scanning /srv/images/production-images/images == Will build the following images: * docker-reg... [17:06:20] o/ [17:24:01] (03PS1) 10AikoChou: revertrisk: modify the response to dict type in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1035012 (https://phabricator.wikimedia.org/T358744) [18:30:35] logging off today o/ [18:39:34] night aiko! [20:59:58] (03PS1) 10Rockingpenny4: Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) [21:01:11] (03CR) 10CI reject: [V:04-1] Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4) [21:12:13] (03PS2) 10Rockingpenny4: Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) [21:14:11] (03CR) 10CI reject: [V:04-1] Adds article topic model to ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1035044 (https://phabricator.wikimedia.org/T218132) (owner: 10Rockingpenny4)