[07:32:33] good morning! [07:32:35] https://grafana.wikimedia.org/d/000000435/kubernetes-api?viewPanel=28&orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&from=now-7d&to=now [07:32:52] for some reason we have some strange latencies on the cluster [07:32:56] that fired some alarms [07:33:14] https://grafana.wikimedia.org/d/000000435/kubernetes-api?viewPanel=27&orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&from=now-7d&to=now [10:12:37] * elukey lunch! [12:06:18] back! [12:06:41] I am trying to roll restart etcd to see what effect it has on latencies [12:07:17] the vm specs are really tiny, so it may also be that our k8s cluster uses more etcd and we need to bump specs a little [12:08:08] even if load/cpu/etc.. are not really showing up much [12:09:05] ok latencies after the restart are much worse [12:09:07] interesting [12:11:17] and the overall latencies for our k8s api went up as well, so they are definitely related [12:14:47] there are a ton of sockets opened between ml-ctrl and ml-etcd [12:15:44] more or less the same on other clusters [12:17:28] the alarm for latencies is based on the main k8s cluster, that is way faster in etcd latencies [12:17:45] we have various ops around ~50ms, main cluster ~10ms [12:18:02] and we have only istio deployed [12:19:41] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=3&orgId=1&var-server=ml-serve-ctrl1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ml_serve&from=now-14d&to=now [12:19:51] ah there you go, cpu usage follow the same latency pattern [12:20:03] and I believe it started when I deployed kubelets on master nodes [12:20:25] yeah load is increased etc.. [12:23:22] ok bumping the vcpus from 2 to 4 [12:39:29] ahahah latencies down to acceptable levels now [14:12:27] as FYI SRE is evaluating Istio as ingress solution in https://phabricator.wikimedia.org/T287007 [16:15:55] ^ very cool [16:24:04] morning! [16:24:32] I think I found a way to render the InferenceService resources in the chart, but at the moment testing to minikube fails due to some yaml horror [16:26:38] nice hope it works out! i had some yaml horror trying to debug the revscoring blubberfile [16:26:45] ahahahha [16:26:50] yaml engineers! [16:31:06] currently looking into how to setup a project pipeline per directory to test + publish the editquality image, not sure if any other wmf projects do this [16:35:57] ah I think I can configure each monorepo directory as a project via integrations/config/zuul/layouts.yaml [17:05:49] for example [17:05:51] Error: unable to build kubernetes objects from release manifest: error validating "": error validating data: [ValidationError(ConfigMap.data.explainers): invalid type for io.k8s.api.core.v1.ConfigMap.data: got "map", expected "string", ValidationError(ConfigMap.data.predictors): invalid type for io.k8s.api.core.v1.ConfigMap.data: got "map", expected "string"] [17:05:59] how is one sane person supposed to debug this? [17:06:26] hahaha [17:53:03] * elukey afk!! [18:09:51] Marshall in the product just had the smart idea that I should just make a proof of concept model card manually first and get some feedback from folks before going down the pywikibot route [18:17:10] ^ that's a good idea [21:39:50] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Inference Clients - https://phabricator.wikimedia.org/T287051 (10ACraze) [22:21:57] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Configure tox tests for inference service pipelines - https://phabricator.wikimedia.org/T287053 (10ACraze) [22:51:37] 10Lift-Wing, 10Machine-Learning-Team: Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10ACraze) [22:53:36] 10Lift-Wing, 10Machine-Learning-Team: Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10ACraze) [22:53:39] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Load outlinks topic model in to KFServing - https://phabricator.wikimedia.org/T276862 (10ACraze) [22:59:10] 10Lift-Wing, 10Machine-Learning-Team: Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10ACraze) [22:59:24] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Load outlinks topic model in to KFServing - https://phabricator.wikimedia.org/T276862 (10ACraze) 05Open→03Resolved Hey all, quick update here -- confirming that the outlinks topic model seems to be stable and performant when run as... [23:02:03] 10Lift-Wing, 10Machine-Learning-Team: Deploy Outlinks topic model to production - https://phabricator.wikimedia.org/T287056 (10ACraze) 05Open→03Stalled Marking this as 'Stalled' for now as we are blocked until all dependencies are fully installed on #lift-wing (for more info see: {T272919})