[15:45:06] ^^ Still having DNS resolution issues on dse-k8s: https://phabricator.wikimedia.org/T346048 Any suggestions? Right now I'm just tcpdumping and comparing configs with staging [15:49:00] btullis any objections to me deploying spark on dse-k8s? I think rdf-streaming-updater is the only non-control plane service that has a helmfile ATM [15:52:41] inflatador: what kind of dns resolution issues? Do you have specific queries that hang? [15:53:43] the pods should use coredns that is correctly deployed on the nodes [15:53:59] (plus if it didn't work we'd have seen explosions earlier on probably) [15:56:45] elukey the containers appear to mount the resolv.conf from the host, which is not a CoreDNS IP, and they can't resolve using the hosts' resolver. I've one-offed a host to add a CoreDNS resolver and the container can properly resolve with that IP, but the containers I looked it in staging were just mounting resolv.conf from the host [15:59:50] inflatador: do you have an example pod to check? [16:01:09] elukey flink-app-wdqs-64c5576cc5-pwqgp on dse-k8s-worker1001.eqiad.wmnet [16:02:04] are the containers expected to have a full routing table? That ctr only has an apipa route [16:02:28] no server is expected to have a full routing table in our env [16:02:32] or container [16:02:37] nameserver 10.67.32.3 [16:02:45] a full routing table btw is >200k routes [16:02:47] it seems the correct one [16:03:09] akosiaris sorry ,what I mean is a non-apipi route [16:03:13] apipa that is [16:03:47] inflatador: the pod doesn't mount any resolv.conf afaics [16:04:30] dnsPolicy: ClusterFirst so yeah it should use CoreDNS [16:09:57] inflatador: btw, regarding the route thing. It's a nice surprise first time you encounter it. The logic is explained here: https://docs.tigera.io/calico/latest/reference/faq#why-does-my-container-have-a-route-to-16925411 [16:11:36] akosiaris Ah, makes sense. I've seen AWS doing magic with APIPA too [16:19:53] a quick check btw: flink@flink-app-wdqs-64c5576cc5-pwqgp:/usr/local/lib/python3.7/dist-packages/pyflink$ host swift.discovery.wmnet [16:19:53] swift.discovery.wmnet has address 10.2.2.27 [16:20:19] maybe it's not DNS but some network policies, add some more info in that task and maybe we can help more [16:20:54] must be...thanks for taking a look! [16:21:54] FWiW that def wasn't working earlier [16:22:41] working now though...y'all are magic ;P [16:25:46] I just did a kubectl exec, nothing more [16:26:30] I've been using `nsenter -t 2660329 -n` to check, and I'm seeing the DNS issue again, should I use `kubectl exec` instead? [16:26:49] nsenter -n only enters the network namespace [16:26:59] not the mount namespace, so you use the hosts mounts [16:27:11] and that's why you saw the resolv.conf of the host [16:27:17] pass -m too to nsenter [16:27:42] but then you are faced with the problem of not having in the container the tools that you have on the host ofc. [16:27:53] thankfully the host still exists [16:28:05] yeah, but our containers are still way better for that than most ;) [16:29:12] anyway, DNS doesn't seem to be the issue, sorry for the troublhe [16:31:05] * inflatador should've just used good ol' docker exec [16:48:09] I think this is related to the weird helmfile.d/dse-k8s-services/rdf-streaming-updater/values-dse-k8s-eqiad.yaml override stuff we're doing. e-lukey already advised against doing this, but I got some pushback on that from my SWE. I'll talk it over w/him again [16:53:32] at times i think templating in values files would be useful, and helmfile supports it with templates suffixed in .gotmpl, but i notice we don't use it anywhere so i've been avoiding it. Is there a particular reason it's not used anywhere?