[07:50:21] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10GitLab (Project Migration): Migrate ORES/Revscoring/etc. repos to Gitlab or Gerrit - https://phabricator.wikimedia.org/T264651 (10elukey) 05Open→03Declined Setting this to Declined for the moment, please re-open if needed :)
[07:54:35] <elukey>	 morning folks
[08:00:58] <isaranto>	 moooorning!
[09:05:27] <elukey>	 klausman: o/ I have created a first attempt of multi-stage build in https://github.com/elukey/stopes/commit/4f7f30fdd5650efa26b27b97ab3a00863b9d6c02
[09:05:50] <elukey>	 what is the best way to test it? Can I use deploy.py? I am wondering if it is ok to push to staging etc..
[09:06:00] <elukey>	 (back in a bit)
[09:10:51] <klausman>	 I think deploy.py currently only pushes to staging unless you use --prod
[09:11:32] <klausman>	 Will take a look at the change in a moment
[09:15:33] <klausman>	 Added one comment. Overall, LGTM
[09:34:30] <elukey>	 klausman: thanks! So do you think that we should keep that check in place?
[09:35:09] <klausman>	 Maybe for the firt few tries, just in case. But I think once the new scheme of building the image is proven to work, we can ditch it
[09:35:27] <elukey>	 ack, updating it
[10:15:52] <elukey>	 klausman: I am wondering one thing - should we just use blubber's sintax to generate the Dockerfile? I was trying to port to the image the same standards that we use (there are some tricky things about nobody like nonexistent home etc..)
[10:16:06] <elukey>	 but then I realized that we could generate the Dockerfile when needed
[10:16:11] <elukey>	 and use the same standards as prod
[10:16:15] <elukey>	 (if it works
[10:16:18] <elukey>	 does it sound ok?
[10:16:40] <klausman>	 I have no objections at all. I'm just a little more versed in naked Dockerfiles than Blubber, but that should not stop us
[10:17:07] <elukey>	 the good thing about blubber is that it forces us to pick up the same conventions that we have in prod
[10:17:19] <elukey>	 that we use in the deployment pipeline I mean
[10:18:54] <klausman>	 I am all for consistency :)
[10:26:26] <isaranto>	 hey, it seems that resources could not be increased with my prev patch as there is a 3Gi max constraint per container so pods were never created . I created a new patch to match that constraint. let me know if it is ok https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865587
[10:27:19] <elukey>	 isaranto: we can check where the 3Gi limit is and lift it, so you can keep testing
[10:27:59] <elukey>	 it is is common.yaml (shared by all clusters, so we can't touch it) in admin_ng
[10:28:03] <elukey>	 limits: etc..
[10:28:04] <isaranto>	 sure, I found out it is universal
[10:28:08] <elukey>	 err limitranges
[10:28:17] <isaranto>	 yy unless we override it for staging (?)
[10:28:40] <elukey>	 nono in that case we need to use ml-serve-yaml in admin_ng
[10:28:50] <elukey>	 it uses a separate helmfile hierarchy basically
[10:29:02] <elukey>	 in there you'll see that we already override limitranges for min cpu etc..
[10:29:12] <elukey>	 you can add the extra limit for memory
[10:33:42] <wikibugs>	 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10elukey)
[10:38:53] <isaranto>	 elukey: like this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865589
[10:40:44] <elukey>	 isaranto: yeah exactly, let's see what CI says
[10:41:53] <elukey>	 need to run an errand, back in a few!
[11:16:25] <elukey>	 isaranto: while I was driving I realized that your initial suggestion about the override for staging was the right one, we can apply the override only to ml-staging-codfw/values.yaml in admin_ng
[11:16:40] <elukey>	 could you update the patch please? Sorry :(
[11:16:44] <elukey>	 otherwise I can do it
[11:16:48] <isaranto>	 hehe
[11:17:04] <isaranto>	 I will! I was laughing about the "while driving" part
[11:17:42] <elukey>	 there was also some swearing involved after the realization of course
[11:19:54] <isaranto>	 no need man!
[11:20:01] <isaranto>	 but I get u
[11:20:23] <isaranto>	 anyway I sent the new patch , lets hope CI agrees 🤞
[11:27:00] <isaranto>	 seems like it works
[11:27:37] <elukey>	 yep, I am going to deploy it in a second
[11:32:20] <elukey>	 isaranto: the pods should now be created, in theory
[11:34:50] <isaranto>	 elukey: didn;;t trigger anything and manual sync doesn't do anything since there is no change. any other way I can trigger pod creation?
[11:36:58] <elukey>	 very interesting (not entirely related)
[11:36:59] <elukey>	     Last State:    Terminated
[11:36:59] <elukey>	       Reason:      OOMKilled
[11:37:03] <elukey>	 for enwiki drafttopic
[11:37:49] <elukey>	 isaranto: I think that we have to kill them manually, doing so with enwiki as test
[11:38:07] <isaranto>	 yeah that does it
[11:39:18] <elukey>	 mmm it doesn't seem so
[11:39:23] <elukey>	 which ones did you modify?
[11:40:02] <elukey>	 I see https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865104/2/helmfile.d/ml-services/revscoring-drafttopic/values-ml-staging-codfw.yaml
[11:40:05] <elukey>	 mmmm
[11:40:28] <isaranto>	 hmm yeah it is showing prev resources..
[11:42:28] <elukey>	 the isvc resource for enwiki-drafttopic seems to have everything set up correctly
[11:42:58] <elukey>	 now it seems working
[11:43:24] <elukey>	 yep
[11:43:27] <elukey>	 the new pod is oj
[11:43:29] <elukey>	 *ok
[11:47:15] <isaranto>	 great! did u do anything else other than delete it?
[11:47:26] * isaranto afk lunch
[11:48:07] <elukey>	 I was about to ask you the same :D
[11:48:24] <elukey>	 it is probably kubernetes taking a bit to reconcile
[11:55:51] <klausman>	 <- lunch & doc
[11:56:19] <elukey>	 weird, the nllb docker build spits out "error: option --home not recognized" related to fairseq
[12:03:37] <elukey>	 anyway, lunch :)
[14:18:49] <elukey>	 (seems a pip bug of course https://github.com/serverless/serverless-python-requirements/issues/240#issuecomment-421433584)
[14:44:04] <isaranto>	 I have a question about a base image we use (`docker-registry.wikimedia.org/buster`) as I understand it is a debian based image, correct? Is it ok if I just use the latest one? I couldn't find a repo that this image is declared so I don't know what are the actual changes introduced each time
[14:44:21] <isaranto>	 I noticed our images use a different version of buster as a base image
[14:45:52] <klausman>	 elukey: I think that may be this bug: https://github.com/pypa/pip/issues/4390
[14:45:53] <elukey>	 yes yes let's use the last one
[14:46:05] <klausman>	 pip is braindead about ordering of flags
[14:46:23] <elukey>	 yeah I think that the -e parameter is not needed IIUC, I am trying without it
[14:46:32] <klausman>	 ack
[14:47:09] <elukey>	 isaranto: most of the base images and "core" ones are declared in the production-images repo in gerrit
[14:47:23] <elukey>	 for example, in our case: knative-serving, kserve, istio, etc..
[14:47:43] <elukey>	 we don't pull from external registries, so we have to rebuild every time 
[14:47:46] <isaranto>	 elukey: thank u , that was what I was looking for to understand the differences
[14:48:16] <isaranto>	 e.g. I see we use `python-buster` and `buster` image so want to dig a bit to undestand
[14:52:11] <elukey>	 yeah they have different apt packages installed basically
[14:54:31] <elukey>	 klausman: has it ever happened to you "Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'" 
[14:54:34] <elukey>	 ?
[14:55:29] <elukey>	 it does make sense since I don't see any requirements.txt copied over to the image, but I am wondering if it is being created by pip with -e or similar?
[14:57:30] <elukey>	 mmm no -e is handy for developing but it doesn't seem to be a problem in our case
[14:57:55] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586)
[14:59:38] <elukey>	 ahhh no ok I think it is a problem with the workdir
[14:59:39] <elukey>	 okok
[15:01:47] <elukey>	 google signed me off, coming to the meeting in a bit sorry
[15:45:21] <elukey>	 klausman: now that I see the usage of pip -e in the Dockerfile I feel even more sad
[15:45:42] <elukey>	 maybe it was a leftover from testing?
[15:54:56] <klausman>	 Quite likely
[15:55:10] <klausman>	 Especially given the other test/debug remnants
[15:55:56] <klausman>	 Also, there used to be the requirement for doing the sub-checkout by hand (and switching to the right branch), so I suspect it's related to that
[16:14:10] <wikibugs>	 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) Started to write some changes in https://github.com/elukey/stopes/blob/aws_publish/misc/aws_publish/dockerfiles/Dockerfile  My goals are two: * Try to sep...
[16:32:24] <elukey>	 klausman: ahh ok now I see
[16:32:24] <elukey>	 WARNING: Target directory /opt/lib/python/site-packages/psutil already exists. Specify --upgrade to force replacement
[16:32:36] <elukey>	 so the -e is a trick for the override
[16:32:36] <elukey>	 sigh
[16:38:48] <klausman>	 I don't think there's any aspect of the setup that I am thrilled about
[16:42:30] <elukey>	 it will be a little painful but moving to blubber is the long term solution in my opinion.. maybe for the moment we can go ahead with a regular multi-stage build etc..
[16:42:40] <elukey>	 and then we can move to blubber
[16:43:20] <klausman>	 Yes, I agree that the bits that are more urgent to fix (root) than others (general structure) can be done by making a better Docker file
[17:06:10] <elukey>	 weird, I see failures for 
[17:06:10] <elukey>	 https://github.com/klausman/fairseq/blob/wmfnllb/setup.py#L255
[17:06:47] <elukey>	 The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
[17:06:53] <elukey>	 maybe a new one just released?
[17:08:06] <elukey>	 anyway, logging off, talk with you next week folks!
[17:15:04] <isaranto>	 cu Luca!
[17:15:27] <isaranto>	 indeed sklearn was supposed to be deprecated as the standard one has always been scikit-learn
[17:22:49] <wikibugs>	 (03PS1) 10AikoChou: revertrisk: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865713 (https://phabricator.wikimedia.org/T323023)
[17:42:15] <wikibugs>	 (03PS2) 10AikoChou: revertrisk: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865713 (https://phabricator.wikimedia.org/T323023)
[21:05:16] <wikibugs>	 (03Abandoned) 10Ladsgroup: Make redis connections for cache slightly healthier [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/633578 (https://phabricator.wikimedia.org/T263910) (owner: 10Ladsgroup)