[15:24:55] Did something change recently regarding the locations of SSL certs? My prod clusters show a change on admin_ng diff, and my staging cluster is completely broken (even for kube_env admin) for kubectl. [15:25:39] s/kubectl/helmfile diff/ [15:26:15] Mh, the SSL change is only for serving.kserve.io/s3-cabundle [15:26:30] (/usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt -> /etc/ssl/certs/wmf-ca-certificates.crt) [15:45:33] akosiaris: do you think that my debugging change maybe blotted out an upstream config and that's the cause? Trying to wrap my head around the templating/YAML layers atm [16:05:08] Welp, rolling back that chaneg temporarily did not fix anything :-/ [16:06:26] kubectkl still wortks correctly, fwiw, but deployment is broken atm. [16:22:47] Yeah, I'm stuck, no idea what is going on, or how it broke. [16:47:10] klausman_: currently in a meeting, will be free in 15m [17:14:05] * akosiaris looking [17:14:32] klausman_: what are you seeing exactly? [17:14:40] sec [17:15:30] in helmfile.d/ml-services/experimental/, running helmfile -e ml-staging-codfw diff fails with a permission error: "Error: query: failed to query with labels: secrets is forbidden: User "experimental-deploy" cannot list resource "secrets" in API group "" in the namespace "experimental": RBAC: clusterrole.rbac.authorization.k8s.io "deploy-kserve" not found" [17:16:15] the prod environments work fine, as do the non-kserve subdirs (e.g. ores-legacy, which is just a frontend to the actual revscoring services) [17:20:24] so, the s3-cabundle change that you see if the result of this puppet private change: 9d0be2f38bb882c7921d521cbbcf3cd93b6d54b0 [17:20:33] (eluke.y) Move away from the Puppet CA bundle all the ml-serve isvcs [17:20:40] merged on Dec 7 2023 [17:20:53] Yeah, I think it's unrelated [17:20:56] so I assume it was just not deployed on the the deployment [17:21:02] and yes, it's unrelated, you are right [17:21:08] I pushed it to all three of our envs, and it didn't make a difference. [17:21:13] looking into the RBAC rights now [17:21:18] ty! [17:21:43] I used kubectl describe clusterrole ... to see if they differ between envs, but I couldn't spot anything [17:31:00] ah found it I think [17:31:38] in your patch, we overrode deployExtraClusterRoles for ml-staging-codfw meaning it no longer inherrited the values of the main ml-serve [17:31:47] oooh! [17:32:05] and the diff is "kserve" which creates the deploy-kserve clusterrole which the error message is complaining about [17:32:13] So it has to be the same in both staging and serve, or I need to figure out how to augment a YAML list instead of overwriting it [17:32:58] it's helmfile internal stuff, it does various templates and augmentations across the various included files and sometimes it's black magic [17:33:42] e.g. having a single new YAML doc stanza in a line (i.e. "---") will make it include things that otherwise it wouldn't [17:34:16] to save you the trouble of having to dig into that hellhole, just delete lines 22-23 from values/ml-staging-codfw/values.yaml [17:34:22] and you should be good to go [17:35:52] making a patch [17:35:56] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/992764 [17:37:06] Thanks a lot for your help!