[07:10:26] <elukey>	 good morning
[07:10:55] <elukey>	 there was an alarm at around 2:30 UTC related to calico on ml-serve1001, its bgp session with the routers broke for some mins
[07:10:58] <elukey>	 https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=kube-system&var-pod=calico-node-v7p9t&var-container=All
[07:11:04] <elukey>	 and then another pod was re-created
[07:11:20] <elukey>	 https://logstash.wikimedia.org/app/dashboards#/view/f6a5b090-0020-11ec-81e9-e1226573bad4?_g=h@ac19d25&_a=h@4aad973 contain the last logs but I don't see anything weird
[07:13:13] <elukey>	 Nov 12 02:17:15 ml-serve1001 kubelet[31935]: I1112 02:17:15.207931   31935 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim ephemeral-storage
[07:13:16] <elukey>	 ooof
[07:59:48] <elukey>	 it seems that the knative webhook is spamming like crazy
[08:04:43] <elukey>	 because it is being called like crazy
[08:04:53] <elukey>	 these are surely the network policies, but not sure why
[09:08:10] <wikibugs>	 10artificial-intelligence, 10All-and-every-Wikisource, 10Epic: Epic: Generalized OCR for Wikisource - https://phabricator.wikimedia.org/T161978 (10TheDJ) @SamWilson this was the ticket listed on the community wishlist. what do we consider the status of this now ?
[10:53:10] <elukey>	 filed two changes to improve network policies :)
[10:53:18] <elukey>	 but the spam should be finished
[10:53:57] <elukey>	 the autoscaler pod was contacting the kube api (that in turn was contacting the webhook, that in turn logged its requests)
[10:54:39] <elukey>	 the autoscaler needs to pull prometheus metrics from all inference service pods, to decide what to do etc..
[10:54:47] <elukey>	 it was blocked so it was trying over and over
[11:17:56] <elukey>	 ok now it remains to fix the network policy between istio and the knative activator
[11:35:44] * elukey lunch!
[16:01:01] <accraze>	 o/
[16:03:13] <elukey>	 o/
[16:57:30] <elukey>	 today's work was https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/738360, but now we have a good set of knative policies
[16:57:39] <elukey>	 some bits are still missing but it is looking good
[16:57:49] <elukey>	 just deployed them, will keep watching some metrics
[16:59:51] <elukey>	 of course now that I said it some pods are not working
[17:44:06] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Great!  I think once we settle this, the next steps will be obvious and (hopefully) will require less investmen...
[18:09:41] <elukey>	 ok fixed :)
[18:09:57] <elukey>	 the webhooks needs to contact the k8s api too (not only to get traffic from it)
[18:10:04] <accraze>	 nice!
[18:10:05] <elukey>	 I can get scores, all  good
[18:10:35] <elukey>	 next week I'll work with Janis on the istio network policies
[18:10:50] <elukey>	 since we don't have a chart for it, it will be a little more difficult
[18:13:35] <elukey>	 goind afk for the weekend! 
[18:13:42] <elukey>	 Have a good rest of the day and weekend folks :)
[18:14:21] <accraze>	 see ya elukey!
[22:12:09] <accraze>	 hmmm weird still running into issues with the kserve-controller-manager crashlooping on the new ml-sandbox
[22:13:14] <accraze>	 also just noticed that i did not use our istio images.... do we have an image w/ istoctl pre-built somewhere?
[22:15:09] <accraze>	 our knative images seem to be running fine so far :D
[23:00:44] <accraze>	 aha! it looks like the kserve config yaml was calling the `/manager` command, whereas our kserve image moved the command to `/usr/bin/manager`
[23:58:29] <accraze>	 oh of course, i forgot about the knative + istio cluster-local-gateway (seems its been renamed in the newer versions to knative-local-gw?)