[08:10:15] morning folks [08:10:19] repooled ores2009 [08:19:14] (commuting to a co-working, bbiab [08:19:16] ) [09:33:24] back! [09:37:15] isaranto: o/ if you need me lemme know [09:38:58] elukey: o/ I need to patch the pods. let me check if a patch would work, otherwise we can edit them directly [09:42:39] isaranto: yes the patch will work, but two things need to be verified: [09:43:00] 1) the change is applied only to staging (here is a specific staging values.yaml to apply overrides) [09:43:34] 2) the templates for kserve-inference support adding the limit/resource values to the predictors. In theory the InferenceService CRD supports it [09:43:42] but I don't recall if we can add it to the predictor [09:46:20] see https://github.com/kserve/kserve/issues/1442 [09:46:40] we render "predictor" as we set in in values.yaml, so it should work [09:50:44] elukey: the patch for enabling MP would work but the resources are not applied as it seems now. lemme check first and will get back to u [10:13:52] isaranto: I left some comments to the code review [10:15:44] I think that if you add a few bits the change should work [10:25:53] elukey: thanks will make the changes in a bit. Trying to fix my local setup with helmfile at the moment as I can't run `helmfile template` to render the end result (like on the deployment server) [10:27:06] isaranto: I never really made it work, I suggest to use something like helm3 template -f fixture.yaml chart/... [10:27:58] -- [10:28:09] I am reviewing k8s LIST latency alerts, and I noticed this in the logs [10:28:10] knative.dev/pkg/controller/controller.go:618: Failed to list *v1.Service: Timeout: Too large resource version: 827305727, current: 827305318 [10:29:49] at least for eqiad [10:29:58] thanks for the hint cause I was hitting a wall there... [10:32:06] isaranto: I have to ask for mercy since those templates are very complicated :( [10:32:19] maybe we can improve the docs [10:32:46] don't worry about me, I don't judge! :D [10:34:56] elukey: I was thinking that if we use kustomize like here --> https://github.com/roboll/helmfile#helmfile--kustomize we could deploy all deployments at once (or group them) -- of course keeping in mind the limitation of firing up many new pods at once u have mentioned [10:35:10] not something for now but just leaving it out there.. [10:37:10] yeah we can work on it, afaics there is not support for kustomize atm so we'll need to chat with SREs in case [10:38:12] trying to delete the knative controller pods on ml-serve-eqiad [10:55:15] after deleting the networking-istio controllers all latencies got back to normal [10:55:33] elukey: \o/ managed to get helmfile to work locally [10:55:52] after some quick search I think that these issues are a weird combination of old knative + old k8s, hopefully going away with the k8s upgrade [10:55:55] nice! [10:56:29] added the chartmuseum/repo to helm `helm repo add wmf-stable https://helm-charts.wikimedia.org/stable/` and then `helmfile -e ml-staging-codfw template` works [10:56:54] super [10:58:09] however as I understand this wouldnt reflect changes made locally to the charts as it fetches them from the chart repo. to do that u need to manually supply the chart directory [10:58:52] isaranto: the alternative for deployment-charts is to use `rake run_locally['default']` [10:58:57] that will essentially run CI [10:59:03] and you get diffs etc.. [10:59:09] but it is slower of course [11:03:50] Good news! I have just successfull queried the AWS service from my workstation. No ACLs or anything atm, but at least the query works end-to-end [11:04:18] klausman: \o/ [11:04:49] isaranto: if I send you a cmdline as a DM, could you try it for me? Just to be sure no AWS credentials on my machine play a role [11:05:08] sure! [11:06:33] working without credentials confirmed \o/ [11:09:10] nice :) [11:21:41] going afk for lunch, ttl! [11:32:23] * isaranto afk lunch [13:43:02] Morning all! [13:52:28] o/ [14:00:03] \o [14:00:14] chrisalbon: AWS querying from outside works \o/ [14:00:29] Yeesssss!!! [14:00:41] Sent Santhosh and Pau some testing code while I go back to figuring out the WMF API GW stuff :) [14:03:24] o/ [15:13:22] isaranto: o/ [15:13:51] let's chat in here that is quicker - what model servers do you want to change? zhwiki only or all the edit quality ones in staging? [15:15:07] elukey: just fixed the patch (waiting for CI) . I want to change all of them so I can add the changes in the same patch and we can just merge this run the tests and then I can remove them (unless resources are an issue). wdyt? [15:15:56] isaranto: no problem for me to test all, but in the last patch you changed only zhwiki [15:16:02] this is why I am asking [15:17:17] lemme fix them and I'll ping you when it is final so that we can merge it [15:17:36] I can't run rake run_locally['default'] .any hints? [15:17:51] do you get a specific error? [15:17:58] `zsh: no matches found: run_locally[default]` [15:18:13] first time I use rake [15:18:20] do you have rake installed?? [15:18:32] yes.. [15:18:44] mmmm [15:19:41] I usually run it on bash, it is the only main difference, can you try? [15:22:09] https://www.irccloud.com/pastebin/BFQoSvmC/ [15:22:32] sorry for the spam . this the trace I get. has a problem with git [15:23:01] which is the website we use for pasting snippets? (I can;t recall) [15:23:16] phabricator's phaste :) [15:23:28] aa yes! [15:23:38] ah ok so the output now makes sense, you'd need to install the ruby gems I believe [15:24:07] it has been a while so I don't recall exactly [15:24:47] isaranto: ah ok I see in the README, there is a step for it [15:24:54] with the command to install the git gem [15:25:08] aaa yes nevermind I'll figure it out. It has been a while for me and gems [15:27:27] elukey: is it ok if I do only the en ones? [15:28:12] isaranto: sure as you prefer, but we should limit the changes to a single wiki (the resource/limits will be applied to all) [15:28:28] on it! [15:38:37] elukey: the gerrit patch should be ok now. I increased resources only for en-wikizzz [15:45:27] isaranto: there are still a few bits that are missing.. the commit msg needs to be updated (and at this point I'd add more words after the first line to explain what we are doing) [15:45:42] ack [15:45:43] in articlequality it seems that all the wikis are changed though, intended? [15:46:47] nope.me sloppy... [15:52:24] isaranto: left other two comments, then it should be good to go [15:56:16] isaranto: I think you missed one of my prev comments, you are setting "true" but it will not work :( [15:56:19] we need "True" [15:59:32] yes, I missed that. tbh this should be in line with yaml and be read as a boolean and not a String.I'll keep it in mind [16:00:21] elukey: I made all the changes. thank u for your time and apologies for my sloppiness. it'll get better :D [16:04:47] isaranto: no problem at all! This is yaml madness so it takes a while to be used to it :) [16:04:59] The change looks good but I have one last doubt [16:05:16] if you check the CI's output, the "True" is still rendered as "true" afaics [16:05:31] I am wondering if it is due to the absence of "" around the True value [16:07:24] that should be it, cause in yaml true==True afaik [16:07:46] ack let's change it then [16:08:36] done [16:09:02] super [16:12:27] klausman: https://phabricator.wikimedia.org/T324200 FYI [16:12:51] :+1: [16:15:32] isaranto: merged :) [16:15:51] 🙌 [16:45:05] https://www.irccloud.com/pastebin/sGEJOEcl/ [16:47:59] ah interesting! [16:51:10] isaranto: I think that the option needs to be added under the "containers" section of our template [16:51:17] where the env variables are added [16:51:28] see `kubectl describe isvc enwiki-goodfaith -n revscoring-editquality-goodfaith` [16:51:56] yeah I wrote that in the snippet above [16:52:02] this is what I am trying to do [16:52:18] ah okok completely lost that bit [16:53:12] going afk in a few, but if you need me we can chat about it tomorrow [16:53:13] unfortunately I dont have access to isvcs... I saw it from the pod description though. will probably need to edit the revscoring_services.yaml template in the cahrt [16:53:21] *chart [16:53:39] yes correct, I'd say also services.yaml [16:53:46] (so we mirror the same functionality) [16:53:51] sure. if I can't make it I'll ping u,thanks & good evening /o [16:54:56] have a good rest of the day folks! [16:54:57] o/ [17:53:47] old. [17:53:53] oops, wrong window :)