[07:10:26] good morning [07:10:55] there was an alarm at around 2:30 UTC related to calico on ml-serve1001, its bgp session with the routers broke for some mins [07:10:58] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=kube-system&var-pod=calico-node-v7p9t&var-container=All [07:11:04] and then another pod was re-created [07:11:20] https://logstash.wikimedia.org/app/dashboards#/view/f6a5b090-0020-11ec-81e9-e1226573bad4?_g=h@ac19d25&_a=h@4aad973 contain the last logs but I don't see anything weird [07:13:13] Nov 12 02:17:15 ml-serve1001 kubelet[31935]: I1112 02:17:15.207931 31935 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim ephemeral-storage [07:13:16] ooof [07:59:48] it seems that the knative webhook is spamming like crazy [08:04:43] because it is being called like crazy [08:04:53] these are surely the network policies, but not sure why [09:08:10] 10artificial-intelligence, 10All-and-every-Wikisource, 10Epic: Epic: Generalized OCR for Wikisource - https://phabricator.wikimedia.org/T161978 (10TheDJ) @SamWilson this was the ticket listed on the community wishlist. what do we consider the status of this now ? [10:53:10] filed two changes to improve network policies :) [10:53:18] but the spam should be finished [10:53:57] the autoscaler pod was contacting the kube api (that in turn was contacting the webhook, that in turn logged its requests) [10:54:39] the autoscaler needs to pull prometheus metrics from all inference service pods, to decide what to do etc.. [10:54:47] it was blocked so it was trying over and over [11:17:56] ok now it remains to fix the network policy between istio and the knative activator [11:35:44] * elukey lunch! [16:01:01] o/ [16:03:13] o/ [16:57:30] today's work was https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/738360, but now we have a good set of knative policies [16:57:39] some bits are still missing but it is looking good [16:57:49] just deployed them, will keep watching some metrics [16:59:51] of course now that I said it some pods are not working [17:44:06] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Great! I think once we settle this, the next steps will be obvious and (hopefully) will require less investmen... [18:09:41] ok fixed :) [18:09:57] the webhooks needs to contact the k8s api too (not only to get traffic from it) [18:10:04] nice! [18:10:05] I can get scores, all good [18:10:35] next week I'll work with Janis on the istio network policies [18:10:50] since we don't have a chart for it, it will be a little more difficult [18:13:35] goind afk for the weekend! [18:13:42] Have a good rest of the day and weekend folks :) [18:14:21] see ya elukey! [22:12:09] hmmm weird still running into issues with the kserve-controller-manager crashlooping on the new ml-sandbox [22:13:14] also just noticed that i did not use our istio images.... do we have an image w/ istoctl pre-built somewhere? [22:15:09] our knative images seem to be running fine so far :D [23:00:44] aha! it looks like the kserve config yaml was calling the `/manager` command, whereas our kserve image moved the command to `/usr/bin/manager` [23:58:29] oh of course, i forgot about the knative + istio cluster-local-gateway (seems its been renamed in the newer versions to knative-local-gw?)