[07:04:04] 10serviceops, 10Parsoid, 10SRE, 10Scap: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Joe) yeah, +1 to killing with fire :) [07:39:44] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Joe) 05Open→03Resolved This is now fully resolved. [08:33:57] 10serviceops, 10Parsoid, 10SRE, 10Scap: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've removed the Puppet class from the bastions, the existing files will vanish with ongoing reimages. [08:52:56] hello folks [08:53:14] since Janis is back next week, lemme know if anybody wants to review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/869771 [08:53:29] the idea is to add explicit support for active/passive services in the cookbook [08:54:09] the implementation is very simple, basically don't do anything (and just print confctl commands) in case the cookbook finds an active/passive svc. [09:40:35] <_joe_> I had different ideas around that [09:41:54] <_joe_> that is, exclude any service that is a/p and not easy to switch *explicitly* [09:42:11] <_joe_> and for the rest, do it in this cookbook or create a specific one [09:45:15] <_joe_> but i can take a look and we can evolve from here actually [09:46:16] could we define the "not easy to switch" ones in puppet's hieradata for services? [09:46:18] I am happy to work on any idea that serviceops has :) [09:46:24] that way it doesn't have to be hardcoded [09:47:10] in multiple places but just in the source of truth of the service definition [09:48:25] in the current cookbook's proposal any a/p svc gets the "not easy to switch" label, then it is the operator that needs to follow up with them.. In theory we should have few a/p services and the majority should be a/a [09:55:21] In wmnet, 13 metafo records, 53 geoip [10:43:31] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1050.eqiad.wmnet with OS bullseye [11:12:23] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1050.eqiad.wmnet with OS bullseye completed: - mc1050 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [13:43:22] o/ _joe_, if you have a minute, i wonder if you could help me with somethign in jay me's absence [13:43:23] https://phabricator.wikimedia.org/T324576#8532328 [13:43:42] https://phabricator.wikimedia.org/T324576#8517742 [13:45:11] oh my and i see luca is begging for attention too ;) [13:49:09] <_joe_> ottomata: have you checked using nsenter if you can telnet to the kube api server:port from the network namespace of your pod? [14:04:15] oh [14:05:22] til nsenter... [14:10:37] _joe_: can i do that from the deployment server? or do I need to be on a container in the pod? [14:11:00] <_joe_> ottomata: you need to get what k8s server runs the pod [14:11:10] oh from the bare metal, got it. [14:11:11] makes sense. [14:11:13] <_joe_> also mark the container id [14:11:29] <_joe_> then ssh there, find the system pid of that container [14:11:36] <_joe_> (using e.g. docker top) [14:11:43] <_joe_> or [14:11:49] <_joe_> k8sh :P [14:12:21] k8sh... [14:13:14] nice [14:19:45] <_joe_> (mine, not comcast's :P) [14:30:43] oh [14:30:49] link? [14:57:56] <_joe_> https://github.com/lavagetto/k8sh [14:58:07] <_joe_> !issync [14:58:08] Syncing #wikimedia-serviceops (requested by joe_oblivian) [14:58:09] Set /cs flags #wikimedia-serviceops claime +Afiortv [14:58:11] Set /cs flags #wikimedia-serviceops kavitha +Afiortv [14:58:39] ottomata LMK if you have time for a quick chat about flink/k8s stuff . Just curious if I can help out with the helm chart (or be useful in any other way ;) ) [15:07:33] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host mc1052.eqiad.wmnet with OS bullseye [15:20:04] _joe_: am i doing this right? [15:20:14] [@dse-k8s-worker1006:/home/otto] $ sudo docker ps [15:20:27] 36505d7d8120 docker-registry.discovery.wmnet/pause:k8s_116 "/pause" 20 hours ago Up 20 hours k8s_POD_flink-app-main-6485449c98-l4xz5_stream-enrichment-poc_d12e1923-7cfc-4b91-b571-cecaa98481d4_0 [15:20:30] <_joe_> you need the id of the container [15:20:35] sudo docker top 36505d7d8120 [15:20:42] PID 1824562 [15:20:46] sudo nsenter -t 1824562 [15:20:56] telnet kubernetes.default.svc.cluster.local 443 [15:20:59] telnet: could not resolve kubernetes.default.svc.cluster.local/443: Name or service not known [15:21:06] <_joe_> sudo nsenter -t -n telnet kubernetes.default.svc.cluster.local 443 [15:21:24] <_joe_> nsenter doesn't work as a context manager [15:21:31] telnet: could not resolve kubernetes.default.svc.cluster.local/443: Name or service not known [15:21:41] i think without a command it just runs bash? [15:21:45] but either way, result is same. [15:21:50] could not resolve? [15:22:20] <_joe_> uhm [15:22:30] 10serviceops, 10Abstract Wikipedia team (Phase θ – Throttling), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) [15:23:16] 10serviceops, 10Abstract Wikipedia team (Phase θ – Throttling), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) >>! In T326785#8523789, @JMeybohm wrote: > Thanks for writing this, very good read! Sounds pretty sensible to fo... [15:24:13] inflatador: hi! we are so close i think! just some networking stuff getting flink pods to talk to k8s? [15:28:31] _joe_: just realized i have networkpolicy.egress.enabled: false. looking at what that would change i'm not sure it would help..but maybe i would? [15:28:48] ottomata Exciting! I can see you've done a ton of work on this, if I can do anything to help LMK [15:29:34] can you figure out why flink-app-example in dse-k8s-eqiad in stream-enrichment-poc namespace cannot talk to kubernetes.default.svc.cluster.local ? :) [15:31:38] hm actually maybe the egress here does matter, i thas some broad cidrs.. trying. [15:32:09] probably not ;) . But it does remind me, we'll probably need to figure out a namespace for prod wikikube. I assume the spark operator stuff will stay on DSE cluster even in prod? [15:32:22] no, it will go to wikikube ttoo [15:32:34] we are just using dse as a safer place to test and develop it [15:33:07] acually, inflatador there are 2 things you could help with! [15:34:48] 1. id' [15:34:56] 1. 1.3.1 of flink-kubernetes-operator has been released [15:35:17] https://flink.apache.org/news/2023/01/10/release-kubernetes-operator-1.3.1.html [15:35:35] it'd be nice to update the image and helm chart [15:36:58] ottomata cool, will get that started [15:37:08] 2. figure if we want/need flink operator leader election/ ha [15:37:08] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/#leader-election-and-high-availability [15:37:25] 10serviceops: Upgrade mc* and mc-gp* hosts to Debian Bullseye - https://phabricator.wikimedia.org/T293216 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host mc1052.eqiad.wmnet with OS bullseye completed: - mc1052 (**PASS**) - Downtimed on Icinga/Alertmanager - Disa... [15:37:27] probably not that important,but it might be nice to have. basically just running more than one flink operator replica [15:40:23] I'd defer to dcausse on whether we want flink operator HA, my gut feeling is to leave out this complexity until later [15:42:36] aye. [15:43:05] it might not be that necessary? if the operator is down, thet flink apps can keep running, they will just have trouble being fully restarted, scaling, etc. [15:43:13] (actually, i think scaling might just work?) [15:54:32] ottomata: network.egress.enabled toggles between allowing whitelisted egress traffic (if set to true) and allowing all egress traffic (if set to false) [15:55:01] that being said, I don't know if there are some global network policies on the dse cluster [15:55:14] let me have a quick look before my next meeting [15:57:17] ah yes, it does have GlobalNetworkPolicy per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/dse-k8s.yaml#6 [15:57:31] so you need to set it to enable to allow your rules to take precedence [16:09:11] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Nope, [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/88... [16:14:42] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm hm, is the extra NetworkPolicy we made the flink-operator chart i... [16:15:22] oh egress false allows all traffic? [16:15:31] OH! [16:15:43] i see. [16:16:33] akosiaris: i seems even with egress.enabled: true, i have the same problem howevwer. [16:18:37] 10serviceops, 10Graphite, 10Technical-Debt: Future of liuggio/statsd-php-client? - https://phabricator.wikimedia.org/T326607 (10Aklapper) [16:20:13] _joe_: okay. i can sudo nsenter -t 1824562 telnet 10.64.0.228 6443 [16:20:30] 443 no good. [16:20:38] and using the kubernetes.default.svc.cluster.local name no good. [16:20:47] what is 6443 vs 443 in this case? [16:20:54] <_joe_> ok I think for the dns -> 301 to elukey [16:21:12] <_joe_> and that is just me being a dumbass and not remembering it's 6443 :P [16:21:25] <_joe_> sorry [16:21:39] <_joe_> 301 to elukey because I'm not sure how coredns is set up in dse [16:21:55] <_joe_> or btullis :) [16:22:22] <_joe_> also... is that the value you get from the downward api? [16:22:47] <_joe_> I expected it to expose the actual kubemaster fqdn [16:23:09] downward? [16:23:21] <_joe_> how do you get the kubemaster address? [16:23:47] <_joe_> sorry, I'm dealing with two other things at the same time [16:23:53] i'm not exactly sure, i'm guess a bit here...actuall now that i've set egress: enabled, maybe I should remove that. i'm setting is explicitly atm. we had to do that for the flink-operator, but mabye we don't for the app [16:23:55] np np [16:24:19] <_joe_> but I'd bet that kubernetes.default.svc.cluster.local might not be a valid address in dse? [16:24:39] <_joe_> sorry, isn't this the flink operator you're trying to make talk to the kube api? [16:24:39] it is for the flink-operator. [16:24:42] no [16:24:54] the flink jobmanager, it will request resources from k8s to create taskmanagers [16:24:56] <_joe_> ok then I am sorry, I misunderstood [16:25:17] the flink operator is working nicely [16:25:25] just trying to get a flink-app to run now [16:25:51] <_joe_> oh uhm, I just realized [16:25:59] indeed, sudo nsenter -t 1824562 telnet dse-k8s-ctrl1001.eqiad.wmnet 6443 works fine. [16:26:16] you think if i dont' override KUBERNETES_SERVICE_HOST to kubernetes.default.svc.cluster.local it should use that? [16:26:17] <_joe_> you were trying to resolve the k8s name from the server [16:26:31] <_joe_> as you just entered the network namespace [16:26:48] <_joe_> telnet still calls gethostbyname(1) which checks resolv.conf on the system [16:26:56] <_joe_> err (2) :P [16:27:25] <_joe_> so yeah, you can connect to the kubemaster I would say [16:27:36] <_joe_> I wouldn't worry about firewalling if telnet works to the IP [16:27:44] <_joe_> nor about the dns if flink-operator works [16:28:03] <_joe_> ok next issue is: what are the symptoms of this not working? [16:28:29] https://phabricator.wikimedia.org/T324576#8517498 [16:28:44] Could not start the ResourceManager akka.tcp://flink@flink-app-main.stream-enrichment-poc:6123/user/rpc/resourcemanager_1 [16:28:53] at org.apache.flink.kubernetes.shaded.okhttp3.internal.platform.Platform.connectSocket(Platform.java:130) [16:29:04] at org.apache.flink.kubernetes.KubernetesResourceManagerDriver.watchTaskManagerPods(KubernetesResourceManagerDriver.java:373) [16:29:44] basically, jobmanager fails because its resourcemanager couldn't ttalk to k8s api...i think. [16:30:02] i assume connection issue because of SocketTimeoutException and connectSocket fail [16:30:21] i don't have info as to what address it is actually trying to connect to. [16:30:24] <_joe_> that's for the address you pasted I think [16:30:40] <_joe_> I would assume some other issue rather than connection to the kube master tbh [16:30:52] no, i think that's a symptom. [16:30:57] the ResourceManager can't start [16:31:13] Could not connect to rpc endpoint under address akka.tcp://flink@flink-app-main.stream-enrichment-poc:6123/user/rpc/resourcemanager_*.", [16:31:17] is the symptom. [16:31:33] then ResouceManager can't start because at org.apache.flink.kubernetes.KubernetesResourceManagerDriver.watchTaskManagerPods(KubernetesResourceManagerDriver.java:373) [16:31:33] ... [16:31:33] Caused by: java.net.SocketTimeoutException: connect timed out [16:31:33] ... [16:31:33] at org.apache.flink.kubernetes.shaded.okhttp3.internal.platform.Platform.connectSocket(Platform.java:130) [16:32:27] <_joe_> I would be curious to look at the sources at this point [16:32:27] prettty sure that callstack there is about failing something with k8s. [16:33:10] <_joe_> so it's trying to watch the api for task manager pods, I would assume [16:33:28] https://github.com/apache/flink/blob/release-1.16.0/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/Fabric8FlinkKubeClient.java#L224-L246 [16:33:29] yeah [16:35:07] <_joe_> is the pod where you saw this error still running? [16:35:26] <_joe_> what namespace [16:35:38] <_joe_> I want to try to take a look myself [16:41:37] kube_env stream-enrichment-poc dse-k8s-eqiad [16:41:42] kubectl -n stream-enrichment-poc logs -f flink-app-main-6485449c98-l4xz5 [16:42:42] <_joe_> so this flink-app main is what should connect to the kube master? [16:43:22] <_joe_> ottomata: uhm wait [16:43:29] <_joe_> KUBERNETES_SERVICE_PORT: 443 [16:44:54] <_joe_> that doesn't look right [16:46:04] yeah i was just trying things [16:46:07] we can just remove that and see [16:46:14] i got that from what we did in the operator. [16:46:16] <_joe_> no please wait a sec [16:46:18] k [16:46:27] <_joe_> where is operator running? [16:46:29] i don't think flink uses the _PORT env var anyway [16:46:31] from what i can tell [16:46:36] same cluster, flink-operator namespace [16:49:10] <_joe_> ok so [16:49:21] <_joe_> you're actually trying to connect to 10.67.32.1 port 443 [16:52:23] <_joe_> and indeed [16:52:25] <_joe_> $ sudo nsenter -t 1824562 -n telnet 10.67.32.1 443 [16:52:27] <_joe_> Trying 10.67.32.1... [16:52:40] <_joe_> this is the clusterIP service you're pointing to [16:54:09] <_joe_> from the flink operator, otoh [16:54:18] <_joe_> dse-k8s-worker1002:~$ sudo nsenter -t 3600622 -n telnet 10.67.32.1 443 [16:54:20] <_joe_> Trying 10.67.32.1... [16:54:22] <_joe_> Connected to 10.67.32.1. [16:54:24] <_joe_> Escape character is '^]'. [16:54:26] <_joe_> ^] [16:54:37] <_joe_> so there's your problem, something is wrong with your networkpolicies I guess [16:55:07] 10.67.32.1 ? [16:55:31] <_joe_> as admin on that cluster [16:55:32] <_joe_> kubectl -n default get service [16:55:47] <_joe_> that's the internal ip [16:55:54] ah [16:55:58] of the k8s api? [16:56:01] <_joe_> and from teh flink operator pod you can connect to it [16:56:03] <_joe_> yes [16:56:07] interesting... [16:56:10] <_joe_> not from the other one [16:56:12] okay thank you, something to follow [16:56:17] <_joe_> I guess some permission missing? [16:56:31] <_joe_> yeah I'm going offline now :) [16:56:38] okay, thank you! this is helfpul [16:57:45] is it possible that this just was never added to default network policy because dse-k8s is new? btullis ? [16:59:20] or, perhaps that global network policy in ds-k8s is not quite right? [17:01:05] dse-k8s.yaml values GlobalNetworkPolicy has [17:01:06] allow-pod-to-pod: [17:01:12] destination: [17:01:12] nets: [17:01:12] # eqiad [17:01:12] - "10.67.24.0/21" [17:01:48] iiuc correctly, this network is just k8s wikikube eqiad? [17:21:02] no, 10.67.24.0/21 is allocated to dse-k8s. [17:21:17] if you got access to netbox, it's prefix https://netbox.wikimedia.org/ipam/prefixes/538/ [17:21:51] so, you are trying to connect to the kubernetes API? with what configuration ? [17:22:21] whatever the kubernetes downward api passes in KUBERNETES_SERVICE_* variables or something hardcoded ? [17:22:55] and I shouldn't be calling that actually downward api as it is happening implicitly while normally the downward api is explicit [17:27:06] i was trying to override KUBERNETES_SERVICE_HOST yesterday to kubernetes.default.svc.cluster.local, but i will remove that. [17:27:23] akosiaris: _joe_ says the problem is talking to 10.67.32.1 from the stream-enrichment-poc namespace [17:28:15] root@deploy1002:~# kube_env admin dse-k8s-eqiad [17:28:15] root@deploy1002:~# kubectl -n default get service [17:28:15] NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE [17:28:15] kubernetes ClusterIP 10.67.32.1 443/TCP 149d [17:28:30] ah that's the automatically provisioned kubernetes service, that also does NAT from 443 to our kubemasters 6443 [17:28:38] i guess i need a rule in this namespace that allows access to that [17:28:58] where should I put that rule? des-k8s.values GlobalNetworkPolicy? or perhaps default-network-policy-conf.yaml? [17:29:49] you shouldn't mess with the GlobalNetworkPolicy to allow things in your namespace to reach out to the kube-api [17:30:19] I got a meeting right now, but I got a slot in 30m from now, let's chat then ? [17:31:07] yup perfect ty [17:32:58] the networkpolicy janis and I added was supposed to allow this, but it looks like we are using the wrong addies from within the flink-app's namespace? [17:32:59] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/879618/ [17:33:44] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/dse-k8s-eqiad/values.yaml [17:42:28] wondering where the 10.67.32 stuff is defined/assigned [17:43:01] oh i see it in puppet. [17:43:04] profile::kubernetes::service_cluster_cidr: [17:46:12] maybe i need that defined in some values somewhere that i can reference? [17:57:35] FYI I have removed the KUBERNETES_SERVICES_HOST override [17:59:23] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) FYI, I'm reverting the KUBERNETES_SERVICE_HOST change in https://gerrit.wik... [18:11:46] akosiaris: o/ :) [18:16:27] i wonder if it would be possible to make a policy that allows access to the kubernetes Service in default with some selectors? [18:16:34] rather than hardcoding IPs somewhere? [18:18:06] gimme a sec, looking at what the operator does [18:19:00] what's with all the flink-* charts? [18:19:05] for the operator, we did set KUBERNTES_SERVICE_HOST and allow it [18:19:05] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/flink-operator/values.yaml#28 [18:19:23] flink-session predates this, that is what WDQS uses. [18:19:49] janis advised to make crds a separate chart, so flink-kubernetes-operator and flink-kubernetes-operator-crds go together. [18:20:19] and flink-app is the one we're working on now, which will be a re-usable chart witth helmfiles like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/dse-k8s-services/flink-app-example/values.yaml [18:21:19] so i think the operator is allowed to talk to k8s via kubernetesMaster.cidrs egress rule [18:21:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-kubernetes-operator/templates/networkpolicy.yaml#5 [18:21:29] so, we aren't debugging the operator, but the app? [18:21:36] right. operator is fine. [18:22:04] the flink containers need to talk to k8s too. the JobManager speaks k8s and creates TaskManager pods [18:22:41] (perhaps there is a cleaner way to allow flink operator to talk to k8s api, but for now that is working) [18:23:01] ok, I can see in the logs that it is indeed proceeding ok [18:23:45] full reconcilliation messages, etc and JobManager having been deployed and all [18:23:46] yup, failure in question is from flink-apps JobManager pod. e.g. in dse-k8s-eqiad cluster and stream-enrichment-poc namespace [18:23:48] kubectl -n stream-enrichment-poc logs -f flink-app-main-794b6767c7-lnkw9 [18:24:01] the app also have a JobManager ? [18:24:33] The operator starts the jobmanager (part of flink-app) the jobmanager starts taskmanagers [18:24:35] so yes. [18:24:50] the operator does not have JobManager, it is just managing the FlinkDeployment CRD. [18:25:08] bumped the flink version in the charts here https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/881458 [18:25:26] (thanks inflatador ) [18:25:32] ok, flink was always pretty confusing to me, now we added k8s complexity in there, it's going to take me a bit [18:25:54] ok inflatador you'll need to pull down the changes to the helm chart from upstream too. [18:26:08] so, the issue is that flink-app-main-794b6767c7-lnkw9 can't talk to the k8s API [18:26:12] akosiaris: correct. [18:26:15] ottomata cool, will work on that [18:26:55] joe says that it should be using the kubernetes Service clusterIP to do that. [18:28:56] it can use either the kubernetes service cluster ip (which is 10.67.32.1, with an internal cluster DNS name of kubernetes.default.svc.cluster.local) or talk straight to the LVS endpoint [18:29:31] I see a netpol named flink-pod-k8s-api in the namespace stream-enrichment-poc that should allow talking to the 2 dse-k8s-ctrl nodes [18:29:52] let me try something [18:30:06] yes, we janis and I added that here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/879618/ [18:30:26] try now? [18:30:28] the operator helm chart actually installed that netpol for this namespace [18:30:56] hm, how to try...i guess i will deletethe pod.... [18:31:02] i'm also testing thigns wih nsenter [18:31:07] which addy do you want me to try? [18:33:41] akosiaris: sudo nsenter -t 1824562 telnet 10.67.32.1 443 (container network namespace) works now, but it diid not before! [18:33:56] ;-) [18:33:58] but, flink-app failed in same way [18:35:04] what is the expected value of KUBERNETES_SERVICE_HOST be in this container? [18:35:10] kubernetes.default.svc.cluster.local ? [18:35:24] because [18:35:27] in a meeting again, back in 30m [18:35:31] [@dse-k8s-worker1006:/home/otto] $ sudo nsenter -t 1824562 telnet kubernetes.default.svc.cluster.local 443 [18:35:32] telnet: could not resolve kubernetes.default.svc.cluster.local/443: Name or service not known [18:35:42] k [18:36:06] hm, wait, no joe said that is trying to use the host nodes resolv.conf [18:36:07] hm [18:36:28] btw, what did you change? [18:37:47] am wondering if i set KUBERNETES_SERVICE_HOST in thte container env to 10.67.32.1 if it would work [18:54:46] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-debug uses production images instead of debug - https://phabricator.wikimedia.org/T326542 (10Clement_Goubert) 05In progress→03Resolved ` cgoubert@deploy1002:~$ kubectl describe pods mediawiki-pinkunicorn-756747d6d9-8zqdz | grep multiversion-debug Imag... [19:04:11] akosiaris: o/ again :) [19:04:45] i think its late for you so no worries, just lemme know and i'll context switch to somethign else until tomorrow [19:05:37] ottomata: I kubectl edited flink-pod-k8s-api network policy and added the IP [19:05:44] sorry context switching a lot right now [19:06:37] ah, k, so that makes sense, but now i'm wondering if the flink is actually trying to talk to that IP at all? it still fails. [19:11:02] akosiaris: what is the expected value of KUBERNETES_SERVICE_HOST in this container? [19:11:30] is it possible somethign is up witih dns? and kubernetes.default.svc.cluster.local is not resolvable to 10.67.32.1 ? [19:12:56] 10serviceops, 10SRE, 10cloud-services-team: hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10fnegri) [19:20:19] ottomata: IP, by default KUBERNETES_SERVICE_HOST equals to the IP [19:20:35] in this container, if you are overriding it, whatever you override it too [19:21:24] i stopped overriding it [19:21:25] but, I have to go. If you can document in a phab paste what the issue that you are seeing is (how to reproduce) and I 'll take a look tomorrow morning [19:21:36] so it should equal the k8s Service IP. hm [19:21:39] okay [19:21:41] thanks akosiaris [19:31:22] 10serviceops, 10Cloud-VPS, 10cloud-services-team: Get Service Operations team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273740 (10fnegri) [19:58:05] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @akosiaris manually edited the flink-pod-k8s-api NetworkPolicy we added to... [19:58:16] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @akosiaris to reproduce, just delete the flink-app-main pod in stream-enric... [20:43:14] OK, updated the helm chart from upstream , LMK if I need to make further adjustments [20:43:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/881458