[12:46:42] does anyone have an idea as to why a deployment running in kube-system would suddenly need a calico network policy to be able to reach out to coredns? https://phabricator.wikimedia.org/T381264#10371120 Thanks [12:48:42] both our ceph csi provisioners (rbd and cephfs) have lost access to Ceph and coredns, causing a wide array of issues over in dse-k8s [13:00:10] brouberol: "usually" the default-deny rule we have in place does not apply to kube-system [13:00:33] so stuff in kube-system should not be subject to netpols in general [13:00:59] how can I see whether this is indeed the case here? [13:01:27] check the GlobalNetworkPolicy "default-deny" [13:01:42] it should filter for projectcalico.org/name != kube-system [13:02:40] indeed [13:03:41] however, I found myself in need of adding a specific rule to re-enable the ceph-csi-cephfs applications to talk to coredns, and they run in kube-system [13:03:55] what seems off is that dse does not have an allow rule for DNS over TCP [13:04:17] helmfile.d/admin_ng/values/dse-k8s.yaml line 35 [13:04:26] that should not affect kube-system either ofc [13:07:37] what also seems strange as well is that it was working "before" - so somethig has changed, right? [13:08:58] that's my thinking as well [13:09:19] from an external perspective, something has changed that caused the provisioner pods to both lose access to coreDNS _and_ ceph [13:09:53] so we can't provision/delete any PV anymore [13:10:38] and the proivisioners have a config listing [cephosd1001:6789,cephosd1002:6789,...], and the resolution of cephosd100x to an IP broke as well [13:11:08] we went around that for a while by using hardcoded IPs instead of hostnames, but the provisioner can't reach out to Ceph now [13:12:56] and we never had to define network policies to enable it to connect to Ceph in the past [13:25:04] can we test what namespaces a global network policy applied to, with calicoctl or something else? [13:28:12] not to my knowledge, no [13:28:57] brouberol: do you restart the pods after creating the policy? [13:29:19] (meta question: could it be that a particular node is the problem?) [13:30:03] jayme: we have done both, but we 've found that creating the policy was enough for the controller to get back to a working state [13:30:14] as it retries things in its reconciliation loop [13:34:06] yeah...I'm just asking because what you're seeing seems flaky and network policies are usually not flaky ;) [13:34:48] well, I wouldn't say things are flaky TBH [13:35:24] something changed (whether self induced without our knowledge or external) that seemingly caused kube-system to be subject to netpols such as any other ns [13:35:29] in a very non flaky way [13:35:58] I understood that DNS worked until it did not and then ceph connections worked until they did not [13:36:16] I'd agree with brouberol - It has taken us a good while to pinpoint what's happening, but it's become clear that we're having traffic in kube-system heavily restricted, whereas previously they weren't. [13:36:29] s/it wasn't/ [13:36:52] sidetrack question: does it make sense to deploy those things into their own namespace(s)? [13:37:28] would not help here probably...but less chance for desaster when messing with policies [13:37:58] Quite possibly, but a decision was made that due to the nodeplugin (daemonset) having to have root access, this should be done in kube-system. I wasn't present at the SIG for that discussion, but went along with it. [13:38:16] > DNS worked until it did not and then ceph connections worked until they did not [13:38:16] let me rephrase [13:38:47] until what looks like today, we never had to define any kind of network policy to have our controllers running in kube-system be able to reach out to coredns or our ceph servers [13:39:15] and today, we found the controllers in a state where they could do neither, and we had to hack manual network policies to restore traffic [13:40:39] so from our perspective, that wasn't flakiness. We started seeing hard and consistent failures [13:40:43] okay. Did you try and run a debug pod in kube-system to check if that is able to connect to dns and ceph? [13:43:31] nope! My debugging was within a python3 shell in one of the provisioner pods (part of the ceph-csi operators) [13:44:02] do you have a specific image you use to do this, usuallu? [13:44:05] *usually [13:53:49] seems like a random debug pod in kube-system _does_ have access to coredns and ceph https://phabricator.wikimedia.org/T381264#10371356 [14:34:17] we've tried to completely remove and reinstall both operators, and we're still seeing the same issue. It's both reassuring and uncanny [14:45:42] brouberol: https://docker-registry.wikimedia.org/wmfdebug/tags/ [14:45:48] the answer to your previous question [14:45:54] for some definition of "usually" [14:45:57] thanks! [14:46:00] I haven't needed this image in a while. [14:58:55] brouberol: do the operators install any network policies themselves by chance? [14:59:12] I checked, but not that I can see [14:59:20] or run privileged/in the host network ns? [15:00:19] one component of the controller does run in the host ns, but not the affected one [15:00:37] that's what gets me scratching my head TBH [15:03:51] I guess what we could do is redeploy the operators in their own NS. We will have to define an explicit netpol to be able to egress to ceph (TBH, the more explicitness the better) [15:03:58] I just don;'t understand what happened [15:04:48] yeah...that really is strange [15:06:50] tell you what. Things are stable for now, as we have these custom netpols in place [15:07:10] might I bug you for a bit of pairing time during the week so we can reproduce and investigate? [15:07:45] my daughter has given me _all_ the germs so investigating firewall and YAML issues with a fever and a headache is .. a challenge :D [15:08:43] yeah, sure - as long as we do google meet only and not in person :p [15:10:12] I'll fax you some germs [15:10:17] thanks again!