[14:32:00] Hi SIG! We're currently investigating regular crashes of the airflow scheduler in all our airflow namespaces. The root cause is always `psycopg2.OperationalError: could not translate host name "postgresql-airflow-XXX-pooler-rw.airflow-search" to address: Name or service not known`. That service name normally resolves, and from time to time, it just [14:32:00] does not resolve at all. [14:32:18] I've checked, the pods behind that service are in Ready when that happens [14:32:43] would you have an idea as to where I should be digging? How can I find the kube-service-controller logs? [14:32:46] thanks! [14:49:45] that feels like it might be an issue with DNS search domains? [14:54:37] sorry, I should have edited the message to be `psycopg2.OperationalError: could not translate host name "postgresql-airflow-XXX-pooler-rw.airflow-XXX" to address: Name or service not known`. [14:54:56] it's happening across all namespaces, albeit not at the same time [14:55:01] not just in the airflow-search ns [15:00:56] brouberol: kube-service-controller logs can be found on the apiservers (journalctl). You might also want to take a look at the coredns dashboard(s) if you spot anything strange there [15:03:11] if possible, try to use a terminates FQDN postgresql-airflow-XXX-pooler-rw.airflow-XXX.svc.cluster.local. to avoid additional lookups - but as it seems from the error coredns is responding properly and just simply does not know? Are the service names volatile (race condition maybe)? [15:05:05] that's the thing: they are not. The pods are running and ready [15:05:25] I'll take you up on the idea of using a svc.cluster.local name though [15:06:32] I have vague memories of lore around k8s and ndots [15:32:20] yeah, I think you can't go further than ndots: 5 [15:32:51] but that might just be that I'm remembering what we set at $PREV_JOB, and took that as absolute [15:33:59] brouberol: oh, I meant, around it being sometimes-flaky, but maybe that was just an Alpine wart [15:57:25] interesting side-note: dse seems to be the only cluster exporting some of the coredns_kubernetes metrics described here https://coredns.io/plugins/kubernetes/#metrics [15:57:37] wonder why that is... [15:58:01] jayme: our chart might not enable metrics [15:58:02] but luckily I gtg :D [15:58:08] we do [15:58:14] and it's the same chart on all clusters [15:58:47] the other (dns related) metrics are around as well. AIUI they all are produced by the prometheus plugin [15:59:08] maybe the metrics are created on first use [15:59:49] yeah...but coredns_kubernetes_dns_programming_duration_seconds pretty much sounds like it's impossible to not be in use [16:00:34] well, I'm going to check back tomorrow hoping someone will have figured it out by then :p [16:17:33] in any case, thanks to both for the pointers [16:22:40] somehow, the mlserve and dse clusters have the rewrite plugin enabled https://grafana.wikimedia.org/goto/c-uVbLDHR?orgId=1 this might be an old chart version I left behind in those clusters by accident [16:23:18] https://grafana.wikimedia.org/goto/S4eHxLDHR?orgId=1