[06:27:58] <elukey>	 Hello folks, I added a +o flag (IRC operator) to all of your nicks
[06:51:35] <elukey>	 ok deopped people, but now you should be able to op by yourself
[06:54:14] <elukey>	 all right back to the latencies and cpu usage for ml masters that I was talking about yesterday
[06:54:35] <elukey>	 the trend of slowly increasing cpu usage keeps happening, even without istio
[06:54:46] <elukey>	 so at this point it is a kubelet problem
[07:31:48] <elukey>	 it seems a slow thread leak so something like that
[07:32:00] <elukey>	 even if from the prometheus metrics I don't see a clear indication of that
[07:37:06] <elukey>	 the kube controller manager restart (manual from me) seems to have removed some cpu usage
[07:37:57] <elukey>	 in journalctl there were a lot of
[07:37:58] <elukey>	 client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
[07:38:33] <elukey>	 let's see during the next hours
[08:00:32] * elukey brb
[09:26:30] <elukey>	 all right this is the idea for the kfserving charts:
[09:26:31] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470
[09:26:35] <elukey>	 basically:
[09:26:50] <elukey>	 - one chart for the kfserving base config (manager pod, CRDs, etc..)
[09:27:02] <elukey>	 - one chart for the InferenceService resources
[09:27:20] <elukey>	 it is missing the TLS bits for the webhook
[09:27:25] <elukey>	 but it works on minikube
[09:27:57] <elukey>	 (helm lint 3.x locally doesn't complain but the CI one does, this is why it returns -1, will try to fix it)
[09:36:59] <elukey>	 for knative-serving + ingress TLS is seems sufficient to have one of the Istio Gateway configs using port 443 + HTTPS settings + referring to an istio secret
[09:37:07] <elukey>	 the last bit will be more difficult to add
[09:37:21] <elukey>	 so it will require a follow up patch, we can test port 80 atm
[09:58:51] * elukey lunch
[12:42:49] <elukey>	 going afk for a bit!
[15:05:20] <elukey>	 ah lovely we don't collect etcd metrics
[15:05:42] <elukey>	 and in codfw I see
[15:05:44] <elukey>	 Jul 22 14:56:30 ml-etcd2001 etcd[387]: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:174" took too long (528.874905ms) to execute
[15:22:14] <elukey>	 ok added it via puppet, we should get metrics soon-ish
[15:27:55] <elukey>	 yep metrics are starting to flow
[15:28:00] <elukey>	 https://grafana-rw.wikimedia.org/d/Ku6V7QYGz/jayme-etcd3?orgId=1&var-site=eqiad&var-cluster=ml_etcd&var-instance_prefix=ml-etcd
[15:52:02] <accraze>	 ^ that's awesome
[16:02:00] <elukey>	 accraze: o/ for some reason we have some weird latency patterns and cpu usage on our k8s masters, all started when we deployed the kubelets on them
[16:02:07] <elukey>	 very weird
[16:02:23] <elukey>	 all k8s magic
[16:02:23] <elukey>	 :D
[16:02:52] <elukey>	 accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470/9/charts/kubeflow-kfserving-inference/templates/services.yaml is the idea that I was talking about
[16:03:32] <elukey>	 I had to create a separate chart only for it since I got errors while testing but it may not be needed
[16:03:49] <elukey>	 but the idea is the same, having a  template + configs in the helmfile yaml configs
[16:04:01] <elukey>	 like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470/9/charts/kubeflow-kfserving-inference/README
[16:04:25] <elukey>	 lemme know your thoughts and if it looks good/bad
[16:05:50] <accraze>	 I like this approach alot
[16:07:05] <elukey>	 and we can change it or tweak it as we go
[17:06:55] * elukey afk!