[07:01:58] good morning :) [08:12:03] ok so https://github.com/istio/istio/pull/13982 is something that confused me a lot, didn't really think about it [08:12:47] Morning! [08:12:55] 82 files changed -.- [08:13:48] that should be already in our version in theory, from 1.2 onward [08:13:48] ah, many of them are just the same edit in injected files [08:13:54] and we have 1.9.5 [08:14:25] but now it is a mesh config, that doesn't really work [08:14:57] https://github.com/istio/istio/blob/1aca7a67afd7b3e1d24fafb2fbfbeaf1e41534c0/pkg/config/mesh/mesh.go#L105-L109 [08:15:53] So by that comment, it should be 60s, but we still see 5s? [08:17:09] I am not getting if there is a way to raise the default 5s ttl value [08:17:26] and what istio suggests as way to go [08:19:00] Do we have DNS proxying on? [08:19:35] like https://github.com/istio/istio/issues/37066 [08:19:38] That only applies to "application" DNS, not service entries [08:19:49] tried it but doesn't really work, it has to respect the 5s ttl as well [08:20:07] what applies to application DNS? [08:20:19] proxying [08:20:26] ah okok [08:22:01] I'm also still confused as to where the 5s TTL actually comes from. AIUI, it is set on ther DNS server side (in Bind and the like, the zone file), but it's not clear to me where that setting resides, if anywhere [08:22:26] These are dynamic records, so there is no zone file, obviously [08:23:23] so this is svc discovery for envoy [08:23:24] https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery [08:23:58] I am wondering if it is knative that sets them now [08:24:09] I thought it was envoy but maybe not after reading the above [08:24:14] It would be the logical place [08:24:47] but no in theory knative uses istio to manage those endpoints [08:29:02] just to backtrack a second [08:29:08] the worst offender is this one [08:29:09] cluster-local-gateway.istio-system.svc.cluster.local. [08:30:09] that is an Istio Gateway, set by knative [08:30:52] it is used for all the internal traffic, and pods constantly fetch it (better, envoy does it) [08:31:17] Yeah, and I _suspect_ the DNS record set by knative is published with a TTL of 5s [08:31:35] Since this is the positive case, the respect_dns_ttl setting is irrelevant [08:31:54] but it shouldn't be knative that sets the TTL [08:32:04] I think that knative just uses the gateway [08:32:08] that we create via istio config [08:32:44] So where does that DNS record come from? [08:36:49] it is a regular kubernetes svc [08:41:07] So kube-dns? [08:43:33] https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/ might be the place to increase the TTL, but I think that would be for _all_ k8s services [08:43:59] well that is what we did yesterday no? With the coredns rewrite thing [08:44:09] You're right. [08:45:36] my brain is confused by all these layers [08:45:45] I am asking to Janis if I am crazy or nt [08:45:46] *not [08:46:11] Same, same re: confusion. I obviously managed to walk in a circle just now [08:51:42] in the doc you posted [08:51:43] "ttl allows you to set a custom TTL for responses. The default is 5 seconds. " [08:52:32] Yeah, I think that rewrite would be the logical way of dealing with the cluster...local query rate [08:52:49] so it is not a stopgap, but the definitive fix [08:52:59] I wish people wrote it in github issues :D [08:53:13] I doubt that record changes often enough to warrant such a low TTL. Actual services are a different matter, so changing the default would be a lot mor dicey [08:59:49] 10Machine-Learning-Team, 10Observability-Metrics, 10serviceops, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10JMeybohm) [09:03:45] ok updated the patches to reflect what we discussed ;) [09:10:35] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Reduce DNS queries from istio-proxies to coredns on ML clusters - https://phabricator.wikimedia.org/T318814 (10elukey) After a bit of digging we should have the correct picture. Let's pick the `cluster-local-gateway.istio-system.svc.clu... [09:13:33] +1'd 837073 as well now. [09:13:55] Not sure why teh gerrit dashboard didn't surface that as "my turn" [09:20:21] super thanks, I'll wait this afteroon for the coredns chart change (to get serviceops' review) and then I'll roll them out everywhere [09:20:33] SGTM [09:21:26] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [09:24:01] * elukey back in a bit [09:54:14] morning! [09:54:20] \o [09:55:11] hi Tobias :) [09:58:10] elukey: thanks for the link to the feature store summit! I saw many interesting topics [10:06:15] :) [10:11:59] I think that the organizer is still Hops, so they are interested in selling their solution :) [10:12:05] but a lot of good talks are lined up [10:13:39] aiko: have you ever changed the kserve default tornado access log format by any chance? [10:13:58] I mean to get something different from "200 POST /v1/models/enwiki-articlequality:predict (127.0.0.1) 155.53ms" [10:14:14] I am looking into it, but if you have some working snipped somewhere I'd take a look [10:15:51] elukey: nope I haven't changed the log format [10:16:12] ack thanks! Going to look for a way to do it [10:17:39] what do you want to change it to? [10:19:56] it is missing details like the User Agent for example [10:20:28] also I am wondering if we could get some info about the callers, not 127.0.0.1 [10:20:35] etc.. [10:20:51] to finally build a logstash dashboard like the ORES one [10:21:02] so we have a breakdown of callers by IP/UA/etc.. [10:27:47] ah wow this is great! https://github.com/kserve/kserve/commit/ff7014b0c1a79672978d5b0a23af6c5ae1158b3b [10:28:14] that is fantastic indeed [10:29:23] of course we don't have it in our version [10:32:14] 🫠 [10:52:09] * elukey lunch! [13:09:07] 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10achou) Hi @Isaac, thanks for your suggestions! These suggestions are all valuable. :) > How to access a model from stat machines and eventually externally? ...... [13:23:51] coredns changes applied to all clusters [13:52:26] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Reduce DNS queries from istio-proxies to coredns on ML clusters - https://phabricator.wikimedia.org/T318814 (10elukey) Rolled out the coredns rewrites to all clusters, way better now! [14:00:37] changing the log format in kserve doesn't seem so straighforward (like adding the UA) [14:00:48] the access log seems to be created in tornado itself [14:54:17] rather than working on tornado logs I chose to improve the istio gateway dashboard :D https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88 [15:32:05] more widget added, it looks nice now :) [15:34:17] going afk for the weekend folks! [15:34:27] have a nice break, talk with you on Monday :) [16:15:30] \o [16:15:50] I just did a quick dashboard check on DNS query rate, and while not perfect, we definitely made quite some strides. [16:15:56] Heading into the weekend now as well