[06:51:38] accraze: o/ I will check later on to see what the error is [08:32:57] 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10elukey) [11:01:11] 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10Urbanecm) No problem with this (if there's an usecase here), but note that individual clients with an acceptable need for higher rate... [12:16:56] 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Proposal: add a per-service rate limit setting to API Gateway - https://phabricator.wikimedia.org/T295956 (10elukey) Hi @Urbanecm! Thanks for the link, very interesting, I didn't know it. My understanding of the API-Gateway is still very high... [12:21:56] istio network policies deployed! They seem working fine [12:22:01] tried to kill some pods too [12:22:21] * elukey lunch [12:50:29] the last step is to add default-deny to global network policies [13:27:56] deployed! I am deleting pods to make sure that they can be re-created correctly [13:28:25] it will also be good for https://phabricator.wikimedia.org/T289578 (I restarted docker on ml-serve nodes yesterday) [13:40:32] all pods restarted [13:43:19] all metrics look good [13:43:23] \o/ [13:59:35] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) Istio policies applied, plus global default-deny (same used in other clusters) applied. Deleted all the pods, they came back up correctly! [13:59:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [14:00:05] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) 05Open→03Resolved a:03elukey [14:02:45] 10Lift-Wing: Bootstrap the ml-serve-codfw cluster - https://phabricator.wikimedia.org/T294412 (10elukey) The base network policies have been deployed to eqiad, so we can proceed with codfw (to double check that everything will go fine when starting from scratch). I reviewed the puppet private and public config,... [14:27:42] going to run errand for a couple of hours, ttl! [15:51:12] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikidata, 10Wikidata-Query-Service, 10articlequality-modeling: Add ORES article quality predictions to the WDQS - https://phabricator.wikimedia.org/T257341 (10Gehel) [16:37:19] o/ [17:03:36] accraze: o/ [17:03:45] still haven't checked the ml sandbox, doing it now! [17:04:52] mmm how do you check pods etc..? [17:04:56] I guess there is minikube [17:05:29] ahh it is installed okok [17:06:21] elukey: yeah its minikube, i think the issue is related to istio gateway but unsure [17:06:36] accraze: how do you check pods? [17:06:45] k get po -A [17:08:14] and those are aliases, yeah I see in your home [17:08:54] ahh whoops yeah.... i aliased `minikube kubectl` to be `k` [17:13:54] accraze: so I think that it works for you since you have a .kube dir in your homedir [17:13:58] with the config etc.. [17:14:53] 10Jade, 10Machine-Learning-Team: Investigate MCR support gap for Jade purposes - https://phabricator.wikimedia.org/T204303 (10CBogen) [17:19:11] how did you and Kevin shared minikube on the previous sandbox? [17:20:40] the minikf distro did it all for us :/ [17:21:10] ahhh [17:24:32] oh i remembered i installed minikube to /usr/local/bin [17:25:31] oof [17:30:13] accraze: the main issue is that if I run minikube etc.. I get that no cluster is running [17:30:30] I can try an hack and copy your .kube dir [17:30:51] ah i see, yeah copying the .kube dir might work [17:32:19] also the .minikube dir [17:32:24] plus some tweaks [17:36:44] accraze: ok I made it :D [17:37:29] one thing that I noticed is that there is no cluster local gateway pod [17:37:40] what istioctl config did you apply? [17:39:44] when i do `k get gw -A` it shows cluster-local-gateway in the knative-serving namespace [17:41:22] i used istioctl from wmf APT repo and then applied istio-minimal-operator.yaml [17:41:26] https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/Local_Kserve#Istio [17:44:28] serving.knative.dev/release=v0.22.0 [17:44:30] :) [17:44:42] you are in the future!! :D [17:44:50] omggg it should v.18.x?? [17:45:15] exactly yes! From 0.19 onward we don't need the cluster local gateway pod in the istio namespace [17:45:23] ahhhhhh [17:46:03] okok that makes alot of sense [17:52:14] i'll think about how to share the cluster for all users, i think minikf ran everything inside of virtualbox [17:57:43] oh actually when i check the knative version i get v0.18.1 [17:58:00] `kubectl get namespace knative-serving -o 'go-template={{index .metadata.labels "serving.knative.dev/release"}}'` [18:03:52] how did you deploy knative? [18:06:06] i downloaded the crd/core/release yaml from github and then changed the images to use the ones from wmf registry [18:08:09] and then just did k apply -f ... for each of those [18:09:52] ah weird then [18:10:15] ahhh ok wait a min, you picked the last release yaml files right? [18:10:20] not the 0.18.1 ones [18:12:15] nah i actually did the 0.22.0 yaml files last week and then re-did it on monday using the 0.18.1 files [18:12:56] kinda confused how you are seeing the 0.22.0 versions still [18:18:38] very weird then [18:18:49] I just did kubectl describe pod blabla on one [18:20:44] the docker image is Image: docker-registry.wikimedia.org/knative-serving-activator:0.18.1-3-20211107 [18:21:20] so I can't explain the 0.22, maybe when you re-applied the 0.18.1 config it was not from a clean state? [18:22:06] anyway, for istio we surely need to add the cluster-local-gateway [18:22:13] haha yeah maybe that was it, i'll try tearing down and starting over today now that i have a better understanding of things [18:22:34] the one that we have in deployment charts explicitly creates it [18:22:51] otherwise local comms betwen pods will not happen (since we don't use the mTLS configs) [18:36:59] going afk! [18:37:15] accraze: please post any questions/doubts/etc.. in case, I'll try to answer tomorrow morning! [18:38:30] ok cool thanks, have a good evening elukey! [18:44:23] accraze: I just realized one thing while walking away from the keyboard :D You were talking about a dev chart, but we could try this compromise [18:44:38] when I test charts locally I usually do [18:45:26] helm3 template 'charts/knative-serving' [18:45:33] in the deployment-chart repo [18:45:55] if you don't pass to it anything, it will use the values.yaml contained in the chart, with default values [18:46:15] but we set other ones via helmfile [18:46:59] we can try [18:47:00] helm3 template -f helmfile.d/admin_ng/knative-serving/values.yaml charts/knative-serving [18:47:59] if you save it as yaml file it will be basically what we apply in production [18:48:03] for the kserve the same [18:48:12] (but changing file paths of course) [18:48:24] for kserve-inference [18:49:11] helm3 template -f helmfile.d/ml-services/revscoring-editquality/values.yaml charts/kserve-inference [18:49:14] etc.. [18:49:55] there are some things to tweak since you'll see a lot of RELEASE-NAME here and there (in prod helm replaces them) [18:50:02] but with a quick sed they can be changed [18:50:25] it would be nice to find something like the above that worked so it would be easy-ish to re-use our prod charts [18:50:43] EOF :) [18:50:46] * elukey afk again [18:51:40] elukey: awesome thanks! i will give this a try this afternoon