[08:59:48] hello folks [09:06:12] Mornin [09:13:35] klausman: o/ I am still not able to push the docker image :( [09:14:04] I am retrying with [09:14:06] python3 deploy.py -f wikipedia-distillated/ -d wikipedia-distillated-20221123-153939 [09:14:17] What error do you get? [09:14:17] but I keep getting the nope 400 response [09:14:29] is the region environment var set? [09:14:51] yep I have AWS_REGION=us-east-1 [09:15:09] weird [09:15:31] maybe try an entirely fresh build (no -d wiki...) [09:17:21] I'm not sure the region is somehow embedded in the data associated with the already-built image. [09:17:37] mmm so it fails doing the docker login [09:17:56] You have to re-do the credentials, since they expire after 20 hours or so [09:18:35] and yes, the get_temp... script is not robust: if the KEY vars are already set it falls over since the aws cli is... not great [09:18:44] I did yes [09:19:05] Let me give it a try [09:19:32] klausman: from what I can see I get the nope.etc.. as ecr_registry [09:19:50] Which shouldn't happen as it -should_ use the REGION var [09:20:51] deploy_config.py is where the "nope" would come from, I guess [09:20:59] oh hang on, did you set "AWS_ACCOUNT"? [09:21:13] The get_temp script doesn't do that since it's committed to public GH [09:21:31] ah no [09:21:36] is it mentioned in the doc? [09:22:00] That it needs setting, yes, and the script warns about it. But the value is not committed anywhere [09:22:26] maybe it should be louder about it [09:23:08] what value should I use? The account id? [09:23:12] yes [09:23:28] the number staring with 618 [09:25:21] nope is gone, but I still get the 400 [09:26:20] let me give it a try, just to see if it isn't something I messed bup somewhere [09:26:50] maybe I don't have access to push to it or similar [09:27:05] yeah, while my try runs, I will do some rummaging in IAM [09:27:42] also the script seems not warning about the absence of AWS_ACCOUNT afaics [09:27:52] at least with -d $docker-image-etc.. [09:28:17] The get_temp one does, but yeah, deploy should as well [09:29:39] https://us-east-1.console.aws.amazon.com/iam/home#/users/elukey says you should have full access (AmazonEC2ContainerRegistryFullAccess policy/group) [09:31:42] The groups we both are in are identical, as are the add'l policies attached directly. I don't think this is a permission issue [09:31:57] okok super [09:33:29] Sent you a screenshot on Slack. Works fine for me, so it's not a quota issue, either [09:34:09] ah now I see [09:34:09] Note that the AWS_ACCOUNT env var is empty. [09:34:10] Running e.g. deploy.py will not work without it. [09:34:20] maybe let's make it more prominent, I totally missed it [09:34:25] yes, will do [09:34:42] But the var was set in your latest attempt, right? [09:34:49] yes yes [09:36:17] klausman: does it work if you try with -d $docker-image as well? [09:36:36] (I didn't get if you did the full rebuild or not) [09:36:40] and it's the docker push that fails, not the later push of the model, right? (log message INFO:build:Archiving the model wikipedia-distillated-20221124-102631 ...) [09:36:57] I am currently running a full build [09:37:13] I dunno what would happen if that succeeds and if I tried to repush it. We'll see [09:37:37] klausman: it is the docker login that fails for me [09:38:53] line 160 of deploy.py [09:39:01] does your .docker/config.json contain valid-looking data? [09:39:18] It's created by the script. [09:40:27] where is it created? [09:41:25] in your homedir [09:41:55] mmm no I don't have that file [09:42:43] hmmm. Did *I* create that? [09:42:49] my notes suck :( [09:43:32] so the script mentions the need of a "setup_config.json" [09:43:35] does it ring a bell? [09:44:11] that's ./dockerfiles/setup_config.json [09:44:24] it only contains entrypoint, extra file includes and stuff like that [09:44:28] No login info [09:44:43] okok I see [09:45:30] I just found a docker login cmdline in my bash_history. So it's not auto created [09:46:17] * elukey bbiab [09:46:24] But that cmdline is... wrong, it does not contain a valid endpoint (still using 'nope') [09:47:35] It's puzzling that I could just copy the GH checkout to another unrelated machine and pushing from there just works. [09:48:27] klausman: I can try a full build, I lost the info if you retried with -d image option [09:50:07] For some reason -d has stopped working for me?! [09:50:15] It seems to always build the image now? [09:50:46] gah, c&p error. [09:53:09] aaah, I get why I can't easily retry: the script - if successful - deletes the local images it created, so me using -d doesn't work since I don't have the image anymore [09:54:52] as for the docker login [09:55:04] See deploy.py:154 [09:55:59] note ecr_registryteh actual login should be happening on line 160 [09:56:17] https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html it's basically the command line from this page [09:56:31] aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com [09:56:54] with region and account ID set (and deploy.py gets the password from the API instead of the commandline tool) [09:57:29] My recommendation: add some pritn/log statements to see what's actually happening at login time [10:04:14] yes yes the docker login is printed, user AWS and password a super long string [10:04:25] That sounds correct [10:04:25] it looks legit, what do you have as user? [10:04:28] okok [10:04:29] AWS [10:05:08] try creating the ~/.docker dir yourself, I don't know if the docker CLI handles that correctly. [10:05:54] it is create [10:05:58] *created [10:06:07] I can try to remove what I have in there [10:06:13] wait [10:06:19] is there a config.json in there? [10:07:09] nope [10:07:28] I have no idea how/what add to it, I asked above :) [10:07:46] sec [10:07:47] you were wondering if you created it [10:07:55] I doubt I did. [10:08:55] $ cat .docker/config.json [10:08:57] { [10:08:59] "auths": { [10:09:01] "[ACCOUNT ID].dkr.ecr.us-east-1.amazonaws.com": { [10:09:03] "auth": "LOOOOONG string" [10:09:05] } [10:09:07] } [10:09:09] } [10:13:10] same error [10:13:58] do you have any other files in ~/.docker? [10:14:39] nope [10:15:07] what happens if you try to run the command line from the doc I linked? [10:15:56] (careful, the region shows up twice in the cmdline: once for the aws command, but also in the URL for docker login) [10:17:22] aws ecr get-login-password --region us-east-1 gives me the pass [10:18:13] and piping that into docker login --user ... --password-stdin [account id ...] should create/modify the config.json file [10:20:52] Sent another screenshot on Slack [10:23:36] klausman: my soul is in pain [10:23:39] It works now [10:23:47] What was the issue? [10:24:04] the aws account id needs to be exported without the dashes [10:24:18] ooooh [10:24:19] * elukey cries in a corner [10:24:28] yeah, not obvious [10:24:46] please add it as a not in the docs :D [10:24:47] Does it work with dashes on the AWS web ui login page? [10:25:05] I copied from the AWS console :D [10:26:00] I will [10:28:26] <- meeting, bbiab [11:16:45] aiko, isaranto - if you test ml-staging please keep in mind that I am testing the rate limit settings, and atm they are not working 100% [11:17:49] elukey: ok. I was running load tests on ml-staging. I will continue with deploying to prod so staging is all yours. [11:20:29] for prod I have to deploy both to eqiad and codfw right? [11:20:44] isaranto: ah snap sorry! I should've asked! I can remove the rate limit and let you test [11:20:47] no problem [11:21:00] yep exactly, both codfw and eqiad [11:21:04] elukey: go ahead I don't need staging for now [11:24:05] ack! [11:24:09] going out for lunch [11:26:43] deploying revscoring-editquality-goodfaith to eqiad and codfw. I am going to wait for these deployments to roll out nicely and continue with the rest [11:35:06] ack. [11:36:52] 10Lift-Wing, 10Machine-Learning-Team: Test MultilingualRevertRiskModel inference service locally with docker - https://phabricator.wikimedia.org/T323613 (10achou) I was able to run the model server locally with docker yesterday, but there are two issues worth noting. 1. memory usage - I used `docker stats` t... [12:11:38] <- lunch & errands [14:55:56] 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou) The way we test batch prediction is to have a spark UDF like: ` @udf def getPrediction(body): inference_url = 'https://inference-staging.svc.codfw.wmnet:30443/v1/models/re... [16:49:20] 10Lift-Wing, 10Machine-Learning-Team: Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) Very interesting explanation: https://learncloudnative.com/blog/2022-09-08-ratelimit-istio I tested the local rate limit in staging briefly, and it seems working nicely. The mai... [17:05:45] tests for the local rate limit went good! I am thinking that it is the best compromise for us at the moment [17:05:53] all info in the task, I'll test metrics tomorrow :) [17:06:02] have a nice evening folks! [17:06:07] * elukey afk [17:09:42] \o [21:53:48] (03PS1) 10Umherirrender: tests: Replace assertEmpty with assertSame [extensions/ORES] - 10https://gerrit.wikimedia.org/r/860655