[08:40:40] hello folks, I am working on the knative helm chart to add some configuration tunables for the ingress gateway configs [09:50:56] ok https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/699380/ is ready for a review! [09:51:02] it looks fine on minikube [09:51:12] we'll of course need some custom values for our clusters [09:51:29] like port 443, TLS certs, etc.. but we can do it later on [10:07:57] Givin' it a read :) [10:11:19] klausman: thanks! [10:11:28] as follow up steps I think that we'll need to [10:11:51] 1) create a TLS cert for inference.wikimedia.org and configure the helm chart for knative accordingly (port 443, secret, etc..) [10:12:31] 2) figure out what certificate to put in kfserving's webhook (https://phabricator.wikimedia.org/T280661#7134949) [10:12:38] since in theory cert-manager will not be needed [10:12:50] these two are the "blockers" for our production set up in my opinion [10:13:25] Yeah, that sounds right. [10:13:31] 1) is probably a use case for cergen, for 2) no idea what's best [10:13:39] As for 2) ... yeah :) [10:18:00] (need to run errand, bbiab :) [10:20:56] +1'd and agreed on the next steps (now off to lunch) [11:09:04] * elukey lunch [12:33:59] ah right I forgot one thing for istio [12:34:36] I had to add a specific rbac policy to make it work [12:34:59] apiVersion: rbac.authorization.k8s.io/v1 [12:34:59] kind: RoleBinding [12:34:59] metadata: [12:34:59] name: allow-restricted-psp [12:34:59] namespace: istio-system [12:35:01] roleRef: [12:35:04] apiGroup: rbac.authorization.k8s.io [12:35:06] kind: ClusterRole [12:35:09] name: allow-restricted-psp [12:35:11] subjects: [12:35:14] - kind: ServiceAccount [12:35:16] name: istiod-service-account [12:35:19] namespace: istio-system [12:35:21] and I have applied it manually [12:36:19] So that still needs to be added to change 699380? [12:36:20] I guess that same may be needed for knative [12:36:45] it is a different one, we don't have any chart etc.. for istio [12:36:52] Right [12:37:06] Well, send any changes needing review my way :) [12:37:42] this is a little weird since there is only one helmfile_rbac.yaml in deployment charts [12:54:03] You mean for all of WMF? [12:54:53] well for all the services defined up to now [12:55:08] Mh. Do you think it could be usefully split up? [12:55:16] basically https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Add_RBAC/PSPs [12:55:35] it is surely good, we don't need what serviceops defines (like tiller etc..) [12:55:41] but no idea how to split it [13:04:09] anyway for the moment we can proceed with helm + manual settings if needed, no idea if restricted-psp will work for knative or if we'll need more [13:04:25] there is also the ? about GlobalNetworkPolicies, it is "allow all" for the moment [13:04:45] lovely [13:33:36] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10kevinbazira) @ACraze thank you for uploading the model and sharing its storage uri. dra... [13:34:39] for inference.wikimedia.org I am reading https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates [13:35:53] that in theory may be usable for knative-servign [13:38:09] in deployment-charts there is a way to auto-configure an envoy sidecar IIUC [13:38:46] but there is also some puppet workflow to deploy certs to /etc/helmfile-defaults/private on the deployment nodes [13:44:35] Would that also imply auto-renewal? [13:45:30] I think that by default there is a long expiry time, but no auto-renewal of any sort [13:45:37] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, and 2 others: Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10kevinbazira) a:03kevinbazira [13:45:59] the only way to make it happen I believe is to have cert-manager hooked up to cfssl [13:46:02] or similar [13:46:11] we'll have the same problem with the webhook certs [13:46:33] istiod auto-creates one when bootstrapping if not set, no idea if it auto-renews, but I don't think so [13:46:36] (another concern) [13:59:00] Yay certificate management is still as messy as it was in ~1998 [13:59:12] Except these days, we don't have to mess around with the openssl cli [14:30:31] any preference in how to proceed? Should we split some work or going in a different way? [14:30:53] I feel that we are often blocked, not sure what to change to speed up time-to-delivery [14:43:27] I think if we can sort the external TLS bits now and with sufficiently long cert lifetimes, we should do it now. [14:44:38] yep, what I meant was how to split the work to be more efficient (if it makes sense) [14:44:58] I can try and take a stab at the externel cert using cergen [14:49:06] ack +1 [14:49:26] all the info are in https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates but I am not sure if the procedure fits what we have to do [14:49:34] I mean, is knative a "service"? [14:49:46] should we come up with some other puppet code? [14:49:47] Sorta [14:50:03] I'd say knative is a metaservice, like a proxy [14:50:49] And since a cert would not be a delegating one (no trust delegation), I don't think we need to have separate Puppet code. [14:51:32] we can reuse profile::kubernetes::deployment_server_secrets::services but it may be a bit of a stretch [14:51:52] it is handy since it automagically deploys certs + key to the deployment node [14:52:06] then the next bit is to source those configs into knative's config in helm [14:52:26] Yeah, the that bit is the scarier one :) [14:53:02] there is a .tpl in the helm repository that all the service use, but it is not what we need [14:53:22] in theory it should be as simple as creating a k8s Secret referencing the cert + key [14:53:36] and use it in knative's config (net_istio.yaml) [15:41:48] 10Machine-Learning-Team, 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) 05Open→03Resolved a:03elukey istio bootstrapped, everything worked nicely, thanks a lot to all that... [15:57:55] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Install KFServing standalone - https://phabricator.wikimedia.org/T272919 (10elukey) [15:58:12] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Knative on ml-serve cluster - https://phabricator.wikimedia.org/T278194 (10elukey) 05Stalled→03Open We were finally able to deploy istio in prod, so this task can proceed! Next steps: 1) Work with service ops on https://gerrit.wikimedia.o... [16:07:41] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) Things to do before closing: 1) Do we need to add a custom TLS certificate for istiod? If not added then istiod creates one, but it is not clear if it auto-renews... [16:08:23] updated all the tasks with the current status, hope to not have missed something [16:14:14] elukey: this all looks really good, sounds like things are going well :) [16:15:29] i'm still debugging our blubberfile for the editquality model inference services, some of the dependencies are being stored in a separate directory and not in the python path [16:16:30] accraze: we are slowly getting closer to something working :) [16:16:45] lemme know if you need any brainstorm [16:18:58] will do, just realized the python path issue yesterday evening, going to dig in today - hoping it's just a simple fix [16:24:40] I think that sometimes we should have informal meetings about what we are doing, and discuss technical problems etc.. we are working on parallel things and a sync every now and then is useful in my opinion [16:24:47] (more than our usual standup I mean) [16:28:40] yeah agreed [16:30:50] :) [16:30:53] going afk! ttl! [16:48:31] see ya elukey [17:05:41] q:q [17:15:31] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) [17:15:35] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Prepare 4 ORES English models for Lift Wing - https://phabricator.wikimedia.org/T272874 (10ACraze) [17:16:43] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) Great work on this @kevinbazira! Confirming that I was able to hit the enwiki-dra... [17:17:55] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Prepare 4 ORES English models for Lift Wing - https://phabricator.wikimedia.org/T272874 (10ACraze) [17:18:20] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) 05Open→03Resolved Marking this as resolved [21:45:10] ah yep, looks like the issue with the blubberfiles was just related to the path, adding `"PYTHONPATH=/opt/lib/python/site-packages"` to the builder command fixed the error with nltk