[10:17:26] heads-up here as well: I've just dropped the deprecated cergen certificates from private puppet (https://phabricator.wikimedia.org/T300033). I did ran a helmfile diff for everything I could think of. But if you see anything dropping certificate data during k8s deployments, please ping me [10:22:41] wow! [14:16:43] Hi all! I'm facing some issues when sending requests to a service behind ingress on the dse-k8s cluster. So far, I have created a VIP for the k8s cluster ingress (10.2.2.91), added that to our DNS, created the LVS service for the cluster ingress gateway, added a CNAME record (spark-history-test.svc.eqiad.wmnet) pointing to [14:16:43] k8s-ingress-dse.svc.eqiad.wmnet. I've also deployed my service on dse-k8s, namespace spark-history-test. However, when running `curl -v https://spark-history-test.svc.eqiad.wmnet:30443`, I get a `connection reset by peer` error. [14:16:51] I have compiled some notes here, if that can help https://phabricator.wikimedia.org/P54329 [14:18:00] when running a tcpdump on the worker node (`tcpdump tcp and port 30443`), it seemed to me that the traffic isn't being received at all, but I might be wrong. As we're receiving probes from the LVS servers and prometheus traffic, I could have gotten that wrong [14:18:34] if anyone with enough knowledge about our networking setup in k8s could spare a bit of time, I'd be really grateful. Thank you! [14:24:09] I should add, I had a quick look with brouberol and was fairly stumped. [14:26:28] I'm not even seeing the request reach the ingressgateway pod really [14:28:21] we can check the listeners via istioctl [14:28:26] to see what is configured [14:30:17] the two that I see are [14:30:17] 0.0.0.0 8443 SNI: spark-history-test.discovery.wmnet,spark-history-test.svc.codfw.wmnet,spark-history-test.svc.eqiad.wmnet Route: https.443.https.spark-history-test.spark-history-test [14:30:22] 0.0.0.0 8443 SNI: echoserver-dse-k8s-eqiad.discovery.wmnet,echoserver-dse-k8s-eqiad.svc.codfw.wmnet,echoserver-dse-k8s-eqiad.svc.eqiad.wmnet Route: https.443.https.echoserver-dse.echoserve [14:32:01] port 8443 is exposed via nodeport 30443 afaics so up to now all good [14:32:19] I'm interested in the first one. The echoserver is a dummy service I've deployed to avoid getting pybal failure alerts if I somehow delete the spark-history-test service [14:32:44] aka to force port 30443 to stay open on all worker nodes [14:33:12] I'm intrigued by the `Route: https.443` - Should this be routing to port 18081 brouberol? [14:34:58] the way I understand it, this routes to the port 443 of the service gateway, that will then route me to port 18081 (envoy sidecar) via an istio virtualservice, which itself is in charge of tls termination and proxying traffic to my app pod, on port 18080 (plain http) [14:35:43] "service gateway" being this thing [14:35:45] brouberol@deploy2002:~/spark-history$ k get gateway [14:35:45] NAME AGE [14:35:45] spark-history-test 85m [14:44:56] probably unrelated but there is a pod broken - spark-history-test-service-checker [14:46:43] yes, this is a test pod injected at runtime. It might be related, or not. The issue is that it's failing to reach http://spark-history-test:18080 [14:47:01] yes I know, this is why I mentioned "unrelated [14:47:18] and I think this is because of the mesh: when mesh.enabled is true, then I don't get the plain service in front of my pod [14:47:54] ?? [14:48:01] where do you see mesh enabled? [14:48:16] ah you mean the module's mesh [14:48:36] yes indeed [14:48:45] in my deployment's values.yaml [14:49:08] can you share your values.yaml? [14:50:03] sure. Right now I have what's defined in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/978629 and a couple of overrides, most of them related to the spark configuration, alongside with mesh.public_port: 18081 [14:51:20] oh and ingress.gatewayHosts.default: "spark-history-test" [15:00:51] mmm with `kubectl get secret -n istio-system` I see only a cert for echoserver [15:01:50] yeah I don't see a "Certificate" resource for spark-history [15:02:33] brouberol: I think that something is not working with your spark-history-test's ingress config [15:03:23] wait, doesn't this one count? [15:03:23] brouberol@deploy2002:~/spark-history$ k get certificate | grep -v NAME [15:03:23] spark-history-test-tls-proxy-certs True spark-history-test-tls-proxy-certs 112m [15:03:35] oops sorry, I read too fast [15:03:58] spark-history-test: [15:04:16] deployTLSCertificate: false [15:04:19] brouberol: --^ [15:05:02] the facepalm is strong with this one [15:05:31] don't worry there are a lot of moving gears :) [15:06:16] nice, thank you! So how can I fix this? Should I just remove the ` deployTLSCertificate: false` stanza from both namespaces and redeploy admin_ng ? [15:06:32] or should I delete both namespaces and re-start from scratch? [15:08:45] I think that you can flip it to "true" and then deploy the admin_ng namespaces' config for dse [15:09:01] it should create the certificate resource, and cfssl should do the rest [15:09:16] once the TLS cert is up, in theory Istio should start routing correctly [15:09:28] I didn't find anything weird from a quick pass in the rest of the config [15:09:42] let's have a look. And first off, th [15:09:50] *thanks so much for your time [15:10:12] but suggestion - after this it may be good to wrap up and review the code change, otherwise the more we add the more difficult it is to jump in and help [15:10:16] np! anytime [15:10:58] re suggestion: agreed [15:13:27] CR sent https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982106 [15:16:17] brouberol: it looks good, one nit - the commit msg is very generic, it doesn't mention neither dse nor spark-history. I'd start with something like "admin_ng: fix gateway TLS settings for DSE" or something similar, and then mention spark-history [15:16:48] 👍 [15:19:46] (updated) [15:21:14] +1ed [15:24:58] indeed, I see the certificates being listed in helmfile diff [15:27:16] * subjectAltName: host "spark-history-test.svc.eqiad.wmnet" matched cert's "spark-history-test.svc.eqiad.wmnet" [15:27:16] * SSL certificate verify ok. [15:27:22] you are the proverbial savior [15:32:34] nice! [16:07:44] actually, I'm a bit confused. I'm able to establish a connection to the envoy proxy, which looks like an HTTP/2 streaming session, that closes after a while with "upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure". It looks as though the gateway fails to redirect me to the envoy [16:07:44] sidecar in charge of TLS termination [16:08:27] the destination of the virtualservice looks correct though, and the associated service has a non-empty endpoint [16:09:50] I'm still trying to figure out whether everything works as expected and I'm just missing some curl headers or flags, or whether I'm missing something in my deployment itself [16:10:13] brouberol: I've not read scrollback: Are you setting SNI for your request? ingressgateway needs it to present you with the correct cert [16:14:16] it seems that the curl I'm using on sretest should be recent enough to have SNI available, at least [16:15:02] but I might indeed be missing something, as I'm not seeing `client_name` in the handshake [16:17:21] (although it's been a while since I had to actively care about SNI so I might just be looking at the wrong think, tell you the truth) [16:17:29] brouberol: serviceops is in a meeting currently .. more or less until sre meeting :/ [16:17:41] with curl -v I see [16:17:42] "subjectAltName: host "spark-history-test.svc.eqiad.wmnet" matched cert's "spark-history-test.svc.eqiad.wmnet"" [16:17:54] so I think we are good [16:18:05] this time I got a 503 for example [16:19:21] brouberol: https://logstash.wikimedia.org/goto/91f68d24ee44c8274d91a848be7f77e2 this may be useful [16:20:51] beautiful, thank you [16:20:58] the flag is UF, upstream connect failure [16:22:01] so that'd be the gateway to the envoy sidecar? [16:22:18] in theory it is the gateway complaining when connecting to the sidecar [16:22:48] (btw, thank you jayme for jumping in) [16:24:18] I tried `curl https://localhost:18081 -k` via nsenter on the spark pod, and it works [16:25:07] I think I might have an idea about what's g [16:25:10] going on [16:25:20] could you try without the --insecure/-k ? [16:27:06] I think that the mesh.certs.cert / mesh.certs.key values were not overriden as I thought they would be, and the envoy sidecar does not have a propert certificate/private key [16:27:26] uh...maybe I missed a thing? [16:27:31] https://phabricator.wikimedia.org/T300033 [16:29:42] 👀 [16:31:34] brouberol: I tried with `echo y | sudo nsenter -t 3388296 -n openssl s_client -connect localhost:18081 | openssl x509 -text` and the cert seems ok [16:31:44] Ah, I see that your PR releases a new version of the istio module! I might still be lagging behind due to the chart being a WIP [16:32:08] I'm using istio_1.0,3 [16:32:19] this is a little weird [16:32:19] Subject: CN = spark-history-test-tls-proxy-certs [16:32:32] but I see [16:32:32] X509v3 Subject Alternative Name: [16:32:33] DNS:spark-history-test-tls-service.spark-history-test.svc.cluster.local [16:32:41] na..CN is to ignore [16:32:44] since years [16:32:45] okok [16:33:17] I just used the cert's name there because CN is limited to N characters [16:33:34] which we consistently fail with the cluster.local stuff [16:34:13] jayme would it be worth upgrading to istio_1.1, as per your recent changes? [16:34:47] not really. it should work with 1.0 as well. I'm sure 1.1 does not fix anything about this :-p [16:35:18] ack [16:39:21] brouberol: I see nodport configured for the spark history svc, it seems strange [16:39:47] have you added a special config to it? [16:40:03] in theory it should be cluster ip [16:40:42] the selectors are also weird [16:41:18] it may be an issue with the svc, it would explain why the gateway cannot contact the sidecar [16:41:27] I haven't changed the service config from what was generated by the scaffolding script, AFAICT [16:42:36] do you have nodeport listed anywhere in your values.yaml? [16:43:43] ah no wait, sorry my bad [16:43:43] (I indeed see that the tls-service service for echoserver is of ClusterIP type). Checking values.yaml [16:44:05] some have nodeport configured, afaics [16:44:20] I checked ores-legacy and it was clusterIp, but rec-api on wikikube seems using the nodeport [16:44:40] I have ingress.keepNodePort set to true, but that was generated automatically [16:45:16] both is fine, nodeport services do have a clusterip as well [16:45:36] but in this case, keepNodePort should be false no? [16:45:58] is can absolutely be...but should not hurt if it's true [16:46:16] are we talking echoserver namespace or spark-history-something? [16:46:27] spark-history-test [16:46:40] I checked, the echoserver chart was scaffolded with keepNodePort: false and spark-history was scaffolded with keepNodePort: true [16:47:07] I have a sense that the `true` probably is a manual mistake from my part, somehow [16:47:41] checked also the svc's selectors, they seem good [16:48:07] (the chart was initially scaffolded w/o mesh and ingress, and that might be the result of keeping wrong values around after re-running ./create_new_service.sh) [16:48:49] I can try to flip that value to see whether it helps [16:49:21] I'd say something is off with the pod itself [16:49:35] ah, no - me stupid [16:50:18] in the network policy the ingress port listed is 18080 [16:50:29] it doesn't seem right [16:50:32] that does not seem right [16:50:40] curl -v -X GET -I 10.67.27.227:18080 - works [16:50:45] curl -v -X GET -I 10.67.27.227:18081 - does not [16:50:50] yeah ok [16:50:55] always the firewall :D [16:51:08] 18081 is the pod public port and 18080 is application port [16:51:27] (just as FYI) [16:51:32] yes but 18081 is the one to be exposed, since it is envoy that takes traffic [16:51:56] it is proably in the mesh config, you have 18080 and not 18081 [16:51:56] I'm sensing a facepalm coming [16:54:00] where is the config for all that btw? I don't see anything in helmfile.d/dse-k8s-services [16:54:29] oh..me stupid again [16:54:56] no...confused. there only is echoserver [16:55:07] jayme: that service was complex to bootstrap as it needed to be able to talk to kerberos and hadoop, so I have a WIP CR and I manually render and apply changes in the namespace, sadly [16:55:43] ah, I see...thybye o/ :-p [16:55:48] which allowed me to get this far without sending tens or CR, but it's making this conversation complicated for y'all [16:55:58] sorry about that [16:56:16] I was joking. You do have the stuff somewhere on deploy2002? [16:56:30] yes, /home/brouberol/spark-history [16:56:54] * /home/brouberol/spark-history/output/spark-history/templates/ [16:57:01] for the rendered templates [17:00:21] I think you've given me more time that I could ask for. I have a way forward, and I need to log off to be a dad in 3 minutes. Thanks again for all the help, I'll see how to fix the ingress [17:01:28] brouberol: I can def. take a proper look tomorrow [17:08:29] no need, that was the last issue <3 The root cause was that I initially scaffolded the app as non-meshed, and later realized I did needed the mesh, but didn't realize that I was including `app.generic.networkpolicy.ingress` and not `mesh.networkpolicy.ingress`. Using the right template and redeploying caused everything to work [17:08:35] I owe you one, once again [17:10:29] it was a nice way for everybody reading to see how to debug ingress -> mesh traffic [17:10:32] :) [17:10:43] we should probably document it somewhere [17:10:49] like: check 1) then 2) etc.. [17:11:02] I always check network policies for last [17:15:37] yeah, me too 🤦 [17:45:38] elukey: there is an unexpected admin_ng diff on ml-serve-codfw. Could you take a look (no rush) and deploy cert-manager on the way? [17:47:08] jayme: ack yes! Will do it tomorrow [17:47:18] cool, thanke [17:47:47] (all other clusters are done