[07:44:41] Successfully published image docker-registry.discovery.wmnet/kubeflow-kfserving-storage-initializer:0.6.0-1 [07:44:44] \o/ [07:52:49] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [08:21:31] all right so the chart has been update/deployed to use our storage-initializer [08:22:03] let's see if I can spin up enwiki-goodfaith [08:37:18] aaand [08:37:20] "failed to resolve image to digest: Get "https://docker-registry.wikimedia.org/v2/": x509: certificate signed by unknown authority" [08:37:43] this is something that I hacked around on minikube as well, I was hoping it didn't re-happen, it is knative-related [08:37:48] will try to see how to fix it [08:47:44] ahh I think https://knative.dev/docs/developer/serving/tag-resolution/ [08:57:25] mmm it is weird thought that this happens with https://docker-registry.wikimedia.org/v2/ [08:57:41] when I was testing locally it was expected [08:57:48] but now mmmmm [09:00:25] in our case, this https://knative.dev/docs/developer/serving/tag-resolution/#custom-certificates doesn't hold, we are not using a self-signed cert [09:37:35] should be https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/711111 (credits to Janis) [10:42:20] * elukey lunch [12:52:14] new problem: [12:52:15] FailedCreate replicaset/enwiki-goodfaith-predictor-default-knzht-deployment-d985c664b Error creating: Internal error occurred: failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator": Post https://kfserving-webhook-server-service.kfserving-system.svc:443/mutate-pods?timeout=30s: EO [12:52:20] sigh [12:53:57] ah the kfserving controller is showing a ton of go stacktraces [12:54:20] http: panic serving 10.64.16.202:35044: Unable to unmarshall logger json string due to invalid character 'd' looking for beginning of value [12:58:11] ah lovely a missed quoting in kfserving's confg [12:58:15] * elukey sends code review [13:15:10] seems still happening [13:17:26] people complains about java stack traces, but go ones are cryptic as well [14:01:51] ok go stacktraces are gone, but pods still don't come up for InferenceService [14:02:08] still see [14:02:08] Error creating: Internal error occurred: failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator": Post https://kfserving-webhook-server-service.kfserving-system.svc:443/mutate-pods?timeout=30s: EOF [14:04:28] ah! [14:04:34] in the kfserving controller I see another [14:04:35] http: panic serving 10.64.16.202:47680: Unable to unmarshall agent json string due to invalid character '"' after object key:value pair [14:04:39] but this is new [14:04:55] so the config of InferenceService is not ok [14:49:41] ok pods are up! [14:49:42] Could not connect to the endpoint URL: "https://https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=goodfaith%2Fenwiki%2F202105140814%2F&encoding-type=url" [14:49:45] ahahha [14:49:49] fixing [14:52:36] all right the storage-initializer seems working now, probably downloading the model [14:53:49] mmmm no I get the double https again [14:56:55] aaand of course [14:56:56] botocore.exceptions.SSLError: SSL validation failed for https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=goodfaith%2Fenwiki%2F202105140814%2F&encoding-type=url [SSL: CERTIFICATE_VERIFY_FAILED] [14:58:20] we probably need ca-certificates on it as well [16:47:11] o/ [16:47:36] elukey: is the InferenceService config messed up? [16:48:13] accraze: hello! No no I have solved it, it was related to some errors in the charts [16:48:21] ahhh ok cool :) [16:48:28] the only thing that I had to fix in the InferenceService config was the thanos URL [16:48:33] stripping https:// [16:48:42] ohhhh interesting [16:48:55] but now I have an issue with the storage initializer, boto of course doesn't trust the certificate of thanos [16:49:09] ughh lol [16:49:23] I hoped for an environment variable to override (I included the wmf certs in the docker image) [16:49:27] but I don't find it [16:49:58] the only thing that I can find is https://github.com/boto/boto/blob/91ba037e54ef521c379263b0ac769c66182527d7/boto/connection.py#L495 [16:53:30] hmm weird yeah i just took a quick look at the repo and didn't see any other env vars [16:53:44] i feel like there has to be a way to override though [16:55:46] do we need to create some sort of ec2-style credentials instead of the certs? [17:02:06] no idea :( [17:02:59] we are also spamming logstash with malformed records [17:03:13] so this may require a follow up with upstream [17:05:43] https://github.com/kubeflow/kfserving/blob/release-0.6/python/kfserving/kfserving/storage.py#L97-L102 [17:07:32] so boto has a ca_certificates_file thing that can be added to a .boto config [17:22:04] so I found AWS_CA_BUNDLE [17:22:55] it is in the docs but can't find it in the core [17:23:14] but it should be, in theory, passed to the storage initializer [17:23:45] that in our kfserving yaml config is listed inside a config-map, so no environment variable or container specs [17:25:18] I have to go in a bit, will restart tomorrow [17:26:38] niiice yeah that sounds promising [17:27:15] accraze: last blockers hopefully!! [17:27:23] sooooo close! [17:27:26] I used the InferenceService spec that you sent me [17:27:41] via kubectl apply -f (plus some pod security policies etc..) [17:27:52] I created a test namespace, and a test service account [17:28:02] created a secret and allowed only the service account to read it [17:28:13] like indicated from upstream [17:37:26] 10Machine-Learning-Team, 10SRE Observability: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) [17:49:52] * elukey afk! [22:49:16] :w [23:01:33] 10Lift-Wing, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Configure draftquality deployment pipeline - https://phabricator.wikimedia.org/T287787 (10ACraze) a:03ACraze