[14:32:00] <brouberol>	 Hi SIG! We're currently investigating regular crashes of the airflow scheduler in all our airflow namespaces. The root cause is always `psycopg2.OperationalError: could not translate host name "postgresql-airflow-XXX-pooler-rw.airflow-search" to address: Name or service not known`. That service name normally resolves, and from time to time, it just
[14:32:00] <brouberol>	 does not resolve at all.
[14:32:18] <brouberol>	 I've checked, the pods behind that service are in Ready when that happens
[14:32:43] <brouberol>	 would you have an idea as to where I should be digging? How can I find the kube-service-controller logs?
[14:32:46] <brouberol>	 thanks!
[14:49:45] <cdanis>	 that feels like it might be an issue with DNS search domains?
[14:54:37] <brouberol>	 sorry, I should have edited the message to be  `psycopg2.OperationalError: could not translate host name "postgresql-airflow-XXX-pooler-rw.airflow-XXX" to address: Name or service not known`. 
[14:54:56] <brouberol>	 it's happening across all namespaces, albeit not at the same time
[14:55:01] <brouberol>	 not just in the airflow-search ns
[15:00:56] <jayme>	 brouberol: kube-service-controller logs can be found on the apiservers (journalctl). You might also want to take a look at the coredns dashboard(s) if you spot anything strange there
[15:03:11] <jayme>	 if possible, try to use a terminates FQDN postgresql-airflow-XXX-pooler-rw.airflow-XXX.svc.cluster.local. to avoid additional lookups - but as it seems from the error coredns is responding properly and just simply does not know? Are the service names volatile (race condition maybe)?
[15:05:05] <brouberol>	 that's the thing: they are not. The pods are running and ready 
[15:05:25] <brouberol>	 I'll take you up on the idea of using a svc.cluster.local name though
[15:06:32] <cdanis>	 I have vague memories of lore around k8s and ndots
[15:32:20] <brouberol>	 yeah, I think you can't go further than ndots: 5
[15:32:51] <brouberol>	 but that might just be that I'm remembering what we set at $PREV_JOB, and took that as absolute
[15:33:59] <cdanis>	 brouberol: oh, I meant, around it being sometimes-flaky, but maybe that was just an Alpine wart
[15:57:25] <jayme>	 interesting side-note: dse seems to be the only cluster exporting some of the coredns_kubernetes metrics described here https://coredns.io/plugins/kubernetes/#metrics
[15:57:37] <jayme>	 wonder why that is...
[15:58:01] <cdanis>	 jayme: our chart might not enable metrics
[15:58:02] <jayme>	 but luckily I gtg :D
[15:58:08] <jayme>	 we do
[15:58:14] <jayme>	 and it's the same chart on all clusters
[15:58:47] <jayme>	 the other (dns related) metrics are around as well. AIUI they all are produced by the prometheus plugin
[15:59:08] <cdanis>	 maybe the metrics are created on first use
[15:59:49] <jayme>	 yeah...but coredns_kubernetes_dns_programming_duration_seconds pretty much sounds like it's impossible to not be in use
[16:00:34] <jayme>	 well, I'm going to check back tomorrow hoping someone will have figured it out by then :p
[16:17:33] <brouberol>	 in any case, thanks to both for the pointers
[16:22:40] <cdanis>	 somehow, the mlserve and dse clusters have the rewrite plugin enabled https://grafana.wikimedia.org/goto/c-uVbLDHR?orgId=1 this might be an old chart version I left behind in those clusters by accident
[16:23:18] <cdanis>	 https://grafana.wikimedia.org/goto/S4eHxLDHR?orgId=1