[07:50:21] 10Machine-Learning-Team, 10ORES, 10GitLab (Project Migration): Migrate ORES/Revscoring/etc. repos to Gitlab or Gerrit - https://phabricator.wikimedia.org/T264651 (10elukey) 05Open→03Declined Setting this to Declined for the moment, please re-open if needed :) [07:54:35] morning folks [08:00:58] moooorning! [09:05:27] klausman: o/ I have created a first attempt of multi-stage build in https://github.com/elukey/stopes/commit/4f7f30fdd5650efa26b27b97ab3a00863b9d6c02 [09:05:50] what is the best way to test it? Can I use deploy.py? I am wondering if it is ok to push to staging etc.. [09:06:00] (back in a bit) [09:10:51] I think deploy.py currently only pushes to staging unless you use --prod [09:11:32] Will take a look at the change in a moment [09:15:33] Added one comment. Overall, LGTM [09:34:30] klausman: thanks! So do you think that we should keep that check in place? [09:35:09] Maybe for the firt few tries, just in case. But I think once the new scheme of building the image is proven to work, we can ditch it [09:35:27] ack, updating it [10:15:52] klausman: I am wondering one thing - should we just use blubber's sintax to generate the Dockerfile? I was trying to port to the image the same standards that we use (there are some tricky things about nobody like nonexistent home etc..) [10:16:06] but then I realized that we could generate the Dockerfile when needed [10:16:11] and use the same standards as prod [10:16:15] (if it works [10:16:18] does it sound ok? [10:16:40] I have no objections at all. I'm just a little more versed in naked Dockerfiles than Blubber, but that should not stop us [10:17:07] the good thing about blubber is that it forces us to pick up the same conventions that we have in prod [10:17:19] that we use in the deployment pipeline I mean [10:18:54] I am all for consistency :) [10:26:26] hey, it seems that resources could not be increased with my prev patch as there is a 3Gi max constraint per container so pods were never created . I created a new patch to match that constraint. let me know if it is ok https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865587 [10:27:19] isaranto: we can check where the 3Gi limit is and lift it, so you can keep testing [10:27:59] it is is common.yaml (shared by all clusters, so we can't touch it) in admin_ng [10:28:03] limits: etc.. [10:28:04] sure, I found out it is universal [10:28:08] err limitranges [10:28:17] yy unless we override it for staging (?) [10:28:40] nono in that case we need to use ml-serve-yaml in admin_ng [10:28:50] it uses a separate helmfile hierarchy basically [10:29:02] in there you'll see that we already override limitranges for min cpu etc.. [10:29:12] you can add the extra limit for memory [10:33:42] 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10elukey) [10:38:53] elukey: like this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865589 [10:40:44] isaranto: yeah exactly, let's see what CI says [10:41:53] need to run an errand, back in a few! [11:16:25] isaranto: while I was driving I realized that your initial suggestion about the override for staging was the right one, we can apply the override only to ml-staging-codfw/values.yaml in admin_ng [11:16:40] could you update the patch please? Sorry :( [11:16:44] otherwise I can do it [11:16:48] hehe [11:17:04] I will! I was laughing about the "while driving" part [11:17:42] there was also some swearing involved after the realization of course [11:19:54] no need man! [11:20:01] but I get u [11:20:23] anyway I sent the new patch , lets hope CI agrees 🤞 [11:27:00] seems like it works [11:27:37] yep, I am going to deploy it in a second [11:32:20] isaranto: the pods should now be created, in theory [11:34:50] elukey: didn;;t trigger anything and manual sync doesn't do anything since there is no change. any other way I can trigger pod creation? [11:36:58] very interesting (not entirely related) [11:36:59] Last State: Terminated [11:36:59] Reason: OOMKilled [11:37:03] for enwiki drafttopic [11:37:49] isaranto: I think that we have to kill them manually, doing so with enwiki as test [11:38:07] yeah that does it [11:39:18] mmm it doesn't seem so [11:39:23] which ones did you modify? [11:40:02] I see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865104/2/helmfile.d/ml-services/revscoring-drafttopic/values-ml-staging-codfw.yaml [11:40:05] mmmm [11:40:28] hmm yeah it is showing prev resources.. [11:42:28] the isvc resource for enwiki-drafttopic seems to have everything set up correctly [11:42:58] now it seems working [11:43:24] yep [11:43:27] the new pod is oj [11:43:29] *ok [11:47:15] great! did u do anything else other than delete it? [11:47:26] * isaranto afk lunch [11:48:07] I was about to ask you the same :D [11:48:24] it is probably kubernetes taking a bit to reconcile [11:55:51] <- lunch & doc [11:56:19] weird, the nllb docker build spits out "error: option --home not recognized" related to fairseq [12:03:37] anyway, lunch :) [14:18:49] (seems a pip bug of course https://github.com/serverless/serverless-python-requirements/issues/240#issuecomment-421433584) [14:44:04] I have a question about a base image we use (`docker-registry.wikimedia.org/buster`) as I understand it is a debian based image, correct? Is it ok if I just use the latest one? I couldn't find a repo that this image is declared so I don't know what are the actual changes introduced each time [14:44:21] I noticed our images use a different version of buster as a base image [14:45:52] elukey: I think that may be this bug: https://github.com/pypa/pip/issues/4390 [14:45:53] yes yes let's use the last one [14:46:05] pip is braindead about ordering of flags [14:46:23] yeah I think that the -e parameter is not needed IIUC, I am trying without it [14:46:32] ack [14:47:09] isaranto: most of the base images and "core" ones are declared in the production-images repo in gerrit [14:47:23] for example, in our case: knative-serving, kserve, istio, etc.. [14:47:43] we don't pull from external registries, so we have to rebuild every time [14:47:46] elukey: thank u , that was what I was looking for to understand the differences [14:48:16] e.g. I see we use `python-buster` and `buster` image so want to dig a bit to undestand [14:52:11] yeah they have different apt packages installed basically [14:54:31] klausman: has it ever happened to you "Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'" [14:54:34] ? [14:55:29] it does make sense since I don't see any requirements.txt copied over to the image, but I am wondering if it is being created by pip with -e or similar? [14:57:30] mmm no -e is handy for developing but it doesn't seem to be a problem in our case [14:57:55] (03PS1) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [14:59:38] ahhh no ok I think it is a problem with the workdir [14:59:39] okok [15:01:47] google signed me off, coming to the meeting in a bit sorry [15:45:21] klausman: now that I see the usage of pip -e in the Dockerfile I feel even more sad [15:45:42] maybe it was a leftover from testing? [15:54:56] Quite likely [15:55:10] Especially given the other test/debug remnants [15:55:56] Also, there used to be the requirement for doing the sub-checkout by hand (and switching to the right branch), so I suspect it's related to that [16:14:10] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) Started to write some changes in https://github.com/elukey/stopes/blob/aws_publish/misc/aws_publish/dockerfiles/Dockerfile My goals are two: * Try to sep... [16:32:24] klausman: ahh ok now I see [16:32:24] WARNING: Target directory /opt/lib/python/site-packages/psutil already exists. Specify --upgrade to force replacement [16:32:36] so the -e is a trick for the override [16:32:36] sigh [16:38:48] I don't think there's any aspect of the setup that I am thrilled about [16:42:30] it will be a little painful but moving to blubber is the long term solution in my opinion.. maybe for the moment we can go ahead with a regular multi-stage build etc.. [16:42:40] and then we can move to blubber [16:43:20] Yes, I agree that the bits that are more urgent to fix (root) than others (general structure) can be done by making a better Docker file [17:06:10] weird, I see failures for [17:06:10] https://github.com/klausman/fairseq/blob/wmfnllb/setup.py#L255 [17:06:47] The 'sklearn' PyPI package is deprecated, use 'scikit-learn' [17:06:53] maybe a new one just released? [17:08:06] anyway, logging off, talk with you next week folks! [17:15:04] cu Luca! [17:15:27] indeed sklearn was supposed to be deprecated as the standard one has always been scikit-learn [17:22:49] (03PS1) 10AikoChou: revertrisk: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865713 (https://phabricator.wikimedia.org/T323023) [17:42:15] (03PS2) 10AikoChou: revertrisk: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865713 (https://phabricator.wikimedia.org/T323023) [21:05:16] (03Abandoned) 10Ladsgroup: Make redis connections for cache slightly healthier [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/633578 (https://phabricator.wikimedia.org/T263910) (owner: 10Ladsgroup)