[06:27:58] Hello folks, I added a +o flag (IRC operator) to all of your nicks [06:51:35] ok deopped people, but now you should be able to op by yourself [06:54:14] all right back to the latencies and cpu usage for ml masters that I was talking about yesterday [06:54:35] the trend of slowly increasing cpu usage keeps happening, even without istio [06:54:46] so at this point it is a kubelet problem [07:31:48] it seems a slow thread leak so something like that [07:32:00] even if from the prometheus metrics I don't see a clear indication of that [07:37:06] the kube controller manager restart (manual from me) seems to have removed some cpu usage [07:37:57] in journalctl there were a lot of [07:37:58] client-go/metadata/metadatainformer/informer.go:89: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource [07:38:33] let's see during the next hours [08:00:32] * elukey brb [09:26:30] all right this is the idea for the kfserving charts: [09:26:31] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470 [09:26:35] basically: [09:26:50] - one chart for the kfserving base config (manager pod, CRDs, etc..) [09:27:02] - one chart for the InferenceService resources [09:27:20] it is missing the TLS bits for the webhook [09:27:25] but it works on minikube [09:27:57] (helm lint 3.x locally doesn't complain but the CI one does, this is why it returns -1, will try to fix it) [09:36:59] for knative-serving + ingress TLS is seems sufficient to have one of the Istio Gateway configs using port 443 + HTTPS settings + referring to an istio secret [09:37:07] the last bit will be more difficult to add [09:37:21] so it will require a follow up patch, we can test port 80 atm [09:58:51] * elukey lunch [12:42:49] going afk for a bit! [15:05:20] ah lovely we don't collect etcd metrics [15:05:42] and in codfw I see [15:05:44] Jul 22 14:56:30 ml-etcd2001 etcd[387]: read-only range request "key:\"/registry/namespaces/default\" " with result "range_response_count:1 size:174" took too long (528.874905ms) to execute [15:22:14] ok added it via puppet, we should get metrics soon-ish [15:27:55] yep metrics are starting to flow [15:28:00] https://grafana-rw.wikimedia.org/d/Ku6V7QYGz/jayme-etcd3?orgId=1&var-site=eqiad&var-cluster=ml_etcd&var-instance_prefix=ml-etcd [15:52:02] ^ that's awesome [16:02:00] accraze: o/ for some reason we have some weird latency patterns and cpu usage on our k8s masters, all started when we deployed the kubelets on them [16:02:07] very weird [16:02:23] all k8s magic [16:02:23] :D [16:02:52] accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470/9/charts/kubeflow-kfserving-inference/templates/services.yaml is the idea that I was talking about [16:03:32] I had to create a separate chart only for it since I got errors while testing but it may not be needed [16:03:49] but the idea is the same, having a template + configs in the helmfile yaml configs [16:04:01] like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/700470/9/charts/kubeflow-kfserving-inference/README [16:04:25] lemme know your thoughts and if it looks good/bad [16:05:50] I like this approach alot [16:07:05] and we can change it or tweak it as we go [17:06:55] * elukey afk!