[00:03:33] 10serviceops, 10SRE, 10ops-codfw: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) @akosiaris @Papaul thanks! ACK, we are done here then :)) [08:46:45] hi folks, I have a generic question about network policies - the ml-serve cluster is basically ready on this front, except for one nit that I'd like to discuss with you. The Knative activator pod (priviledged, acts sometimes as queue of requests to allow pods to scale up or as load balancer for low traffic pods) [08:47:25] is contacted only, in theory, by istio pods. I tried to add a policy to reflect that, but then I realized that the kprobes configured were blocked [08:48:06] so the activator pod stopped working, until I 'relaxed' the network policy to allow traffic only to a certain port but without any source pod restriction [08:48:45] (the kprobes seem to come from the loopback IP address on worker nodes configured for the lvs endpoint) [08:49:14] I am wondering if it is ok to leave network policies as they are now, or if there is a security risk and hence more work is needed [08:56:34] (need to double check about the priviledged, I think that the psp policies allow restricted) [08:57:19] (yep) [09:00:39] elukey: you know why that thing needs to run privileged? Does it run in hostNetwork as well? [09:02:30] nono sorry it runs in restricted, I keep confusing the names (see above) [09:03:25] (priviledged is only for kube-system in ml-serve.yaml) [09:06:10] hm. I'm still puzzled on why this happens for this particular pod/healthcheck only [09:11:22] well, I'm assuming you have other endpoints that are restricted in a similar way. Do you actually have? [09:11:45] what do you mean with endpoints? [09:12:14] pods, sorry [09:13:32] ah okok! At the moment I think that the activator is the only restricted pod that has this constraint, I can review the others but knative/kserve is mostly about controllers and webhooks (so contacted by the kubeapi, can contact the kube api) [09:13:52] if these have a readiness probe, should they be affected? [09:14:31] in theory yes [09:14:34] mmmmm [09:14:34] I still assume network policies have some automatioin to allow kubelet access to configured health check ports. But maybe I'm wrong there and we have to explicitely allow that in network policies [09:15:20] I will try to tcpdump another knative pod and see from what IP it receives the kprobes from [09:16:01] in latter case I'm surprised we did not ran into this earlier [09:16:54] thanks for the brainbounce, will keep confronting pods to see differences, maybe I find something useful [09:18:16] at some point I'll probably do a presentation of ml-serve to serviceops so people can swear in different languages and tell us how much they love our stack [09:22:23] <_joe_> jayme: yeah no, if you firewall off the port for the healthcheck, the readinessprobe will fail [09:22:36] indeed...it seems like it [09:22:40] <_joe_> and yes, we did run into this earlier [09:22:50] <_joe_> people just fixed it without walls of text in this channel :P [09:22:58] ah, great. Then I just missed that [09:23:54] but how would you fix that if it's not totally clear what the source of the health check requests is? [09:24:24] in the ml case, IIRC, it's the LVS IP of the node [09:26:21] <_joe_> jayme: you just need to open that port to all intra-k8s traffic [09:29:29] _joe_: I don't see how that could be done when the source IP is "one of the IPs the node has" [09:31:25] <_joe_> uhh wait a sec [09:31:39] <_joe_> we had this problem for readinessProbes going to non-open ports [09:31:51] <_joe_> in this case what's calling the service? [09:32:26] in the activator use case, only istio, but from tcpdumps readiness probes come from the LVS endpoint IP [09:32:50] I am checking other knative pods, some have a similar restriction and a readiness probe, but they don't fail [09:32:56] going to run some tcpdumps [09:33:11] either I didn't apply the policies in there or something else is happening [09:38:36] Host: 10.64.78.128:8080 [09:38:36] User-Agent: kube-probe/1.16 [09:38:36] k-kubelet-probe: autoscaler [09:38:51] 10-64-78-128.autoscaler.knative-serving.svc.cluster.local [09:39:00] this is different for example [09:40:13] now I am going to tcpdump again the activator pod, I have deleted all pods last week to apply the docker log settings and to test policies [09:58:43] so, now the problem is not there anymore [09:58:56] I see kubelet probes coming from a .svc.cluster.local [09:59:28] what I changed recently was: 1) Global network policies applied 2) complete roll restart of all pods in the cluster [09:59:47] \o it's been a while since I deployed the linkrecommendation service. I'm starting with staging, but getting "Failed to connect to staging.svc.eqiad.wmnet port 4005: Connection timed out" when I run the usual commands to verify the staging deploy. [10:00:07] (see https://wikitech.wikimedia.org/wiki/Add_Link#staging for the "usual commands") [10:01:53] elukey: hmm...interesting. So what get's enabled via global network policy is (besides others) the allow-pod-to-pod rule. [10:02:58] jayme: I added that part before, the last bit that I was referring to above was the default-deny bits [10:03:02] is it some random .svc.cluster.local IP or always the one of the pod that is being checked? [10:03:17] kostajh: I'll have a look [10:03:23] it is 10-64-79-243.activator-service.knative-serving.svc.cluster.local. [10:03:28] consistent [10:03:46] going to try to re-add the stricter network policies [10:09:42] jayme: thanks [10:09:52] it's not urgent [10:12:22] ok [10:19:34] (added policies, now it works) [10:21:50] elukey: source address selection for a packet in a multi IP address/interface host (such as the kubernetes boxes) in linux can be a pain [10:22:03] http://linux-ip.net/html/routing-saddr-selection.html [10:22:23] The application can request a particular IP [20], the kernel will use the src hint from the chosen route path [21], or, lacking this hint, the kernel will choose the first address configured on the interface which falls in the same network as the destination address or the nexthop router. [10:22:59] will read the docs thanks [10:23:06] still very weird that now everything works [10:23:08] IIRC the kubelet does not request a particular IP (but will need to revisit that), I don't think we have an src hint on those paths [10:23:21] lemme verify that [10:23:57] but you are experiencing an old problem (the fact that you see LVS IPs being the originators of the kubelet checks), it's not something specific to kubernetes though [10:25:26] ack thanks! [10:30:21] akosiaris: can I bug you with some calico? :D [10:31:23] yup [10:31:37] it will be a refreshing break from writing docs [10:32:56] I'm looking at the issue kosta.jh has with linkrecommendation. It turns out that the pod IP (10.64.75.0) has been blackholed only on kubestage1001 ("blackhole 10.64.75.0/26 proto bird") - you know why it would do so? [10:34:30] the other nodes do blackhole some pos ips as well [10:34:36] *pod ips [10:40:02] it's not the pod IP that is blacholed [10:40:06] it's the entire /26 prefix [10:40:48] https://github.com/projectcalico/calico/issues/3246 ... maybe [10:40:51] the reason for the blackholing rule is so that the node will say to upstream BGP router "Here, I am responsible and authoritative for this prefix" [10:40:52] yeah, ideed [10:41:16] and then it relies on more specific routes (/32) for each individual pod [10:41:52] in that way, if there is a pod with say 10.64.70.199 as an IP address you can send traffic to it (the more specific wins) [10:41:53] but in that case, every node should blackhole it's prefix, no? [10:42:19] but if you try to say send to 10.64.70.200 and no such pods exists while a node has a rule like blackhole 10.64.70.192/26 proto bird [10:42:29] in that case the traffic will go to the node and from there to dev nul [10:42:37] aka blackholed [10:42:50] yes, every node blackholes the prefixes it is responsible for [10:44:03] the new ones did not (kubestage100[34]) [10:44:19] they haven't had a prefix yet assigned, that's why [10:44:24] start a pod and they will [10:44:52] 1003 has the linkrecommendation pod (10.64.75.0) running [10:45:15] which is reachable from every node apart from 1001 (because blackhole) [10:45:33] btw, those are valid questions all. Should we document them in a Kubernetes/calico wikitech page? [10:45:37] oh wait [10:45:43] yes, will do [10:45:51] what's the size of the kube pod ip space in staging ? [10:46:02] (just because I def will forget again :D) [10:46:09] ah *bulb* [10:46:30] 10.64.75.0/24 [10:46:49] ah. That does not explain it then. It should be able to serve up to 4 nodes fine [10:49:27] but still... 1001 and 1002 do have two prefixes blackholed each [10:50:52] and that's the issue [10:51:06] kubectl describe blockaffinities.crd.projectcalico.org [10:51:20] after kube_env admin staging [10:51:35] kubestage1003 and kubestage1004 are without a block [10:51:39] yeah [10:51:47] while kubestage1001 and 1002 have gobled up all the prefixes [10:51:48] sigh [10:51:50] and they steal from the others then [10:52:09] it's a first come first served situation [10:52:14] the bad thing is that we did not catch it [10:52:19] and it's going to happen again too [10:53:38] the fix is probably going to be easy [10:53:43] the assignment of blocks is first come first serve you mean? [10:53:51] yes [10:54:01] a node get the first /26 [10:54:06] and then can ask for more if needed [10:54:20] I am guessing kubestage1001 at some point was running more than 64 pods ? [10:54:28] and similarly kubestage1002 at some point [10:54:45] I hope so..if not it would be super weird [10:55:15] an easy fix is probably to drain an old node and then delete with kubectl delete node [10:55:31] that's if we don't want to assign 1 more IP prefix to the staging cluster [10:56:00] that's the other fix, a little bit more involved IIRC as we will need to also update the routers to accept the new prefix too [10:56:07] yeah...but let's take a step back first [10:56:13] sure [10:57:12] there are two problems here: 1) a node claimed more then one prefix (not bad in general, but not desired in our case, I guess) [10:57:47] 2) the node without a prefix was able to launch a pod and assign an IP it "borrowed" from the prefix of another node [10:58:13] IMHO we should not allow 2) to happen [10:58:37] 1) isn't really a problem IMHO. It's how the system was designed. Give it pools and nodes end up using them. [10:58:44] we can kinda force it to not happen [10:58:59] we only allow something like 64 pods to run one a node and call it a day [10:59:06] the current limit is anyway 110 IIRC ? [10:59:33] but I 'd like to add a 3) We never got notified by our alerting that we ran out of pools [11:00:14] As for 2) you are absolutely right. That one is bad and can bite. [11:00:31] maybe the CNI should not even be assigning an IP to start with? [11:00:48] yes [11:01:24] there is a StrictAffinity config option got calico ipam [11:02:36] ah that's new, I wasn't aware [11:02:42] yeah, let's experiment with that ? [11:04:07] yes. I'll put all this into a task first [11:06:45] regarding 3): did you expect calico/alerting to tell us? Or are you saying we should implement something? [11:07:21] the latter. I did not expect it, but if it has something, that would be beautiful. [11:07:28] ack, agreed [11:11:49] 110 pods per node is correct btw [11:14:12] https://docs.projectcalico.org/archive/v3.16/reference/resources/ippool#block-sizes "If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size."2 [11:15:52] AIUI that means that a 10.64.75.0/32 route should have been added to kubestage1001 [11:17:09] but that maybe only works with node-to-node mesh enabled [11:18:01] could be. The destination might very well anyway not be in the same subnet as the borrowed from host anyway [11:18:24] in which case it just would work if there was a specific route [11:18:31] wouldn't* [11:19:44] if it were from a completely different IPPool or what do you mean? [11:19:59] let me paint an example [11:20:14] kubestage1001 has 10.64.75.0/26 [11:20:30] kubestage1003 get a new pods and "borrows" 10.64.75.1 [11:20:54] in this case, if the 2 nodes are on the same subnet kubestage1001 can get a new specific route that looks like [11:21:23] 10.64.75.1/32 dev eth0 via kubestage1003_IP [11:21:52] that would work okish. It's suboptimal, but it would work, cause the 2 nodes are on the same subnet [11:22:23] but if they are not on the same subnet, the next hop (the argument of "via") can't be kubestage1003_IP [11:22:37] cause being on the same subnet is a requirement [11:23:04] so it would be work for calico to set it [11:24:44] okay, now I got you. Thanks for painting :) [11:27:50] still wonder why it's the default in calico and they expect it to work... [11:44:45] I really love how all of the IPAMConfig is undocumented... [11:45:21] code sais there is also a maxBlocksPerHost attribute that we could set to 1 [11:46:04] but that's not realls ideal in the same way it's not ideal to restrict the pods to 64 per node [12:33:48] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jelto) All team members should have access now and should be able to execute the needed commands. I'm closing this task. Feel free to... [13:44:13] kostajh: sorry for that taking so long. Your tests should work now as expected [13:56:57] jayme: thanks, will try now [14:00:03] jayme: on to the next error (if you have other things to do, we can leave this for another time) [14:00:18] shoot :) [14:00:20] "Error: open /etc/helm/cache/archive/linkrecommendation-0.1.13.tgz: permission denied". I get this with "helmfile -e eqiad -i apply" [14:00:34] also "Error: plugin "diff" exited with error" [14:00:46] a puppet run should fix this one in theory [14:01:00] (kicking off a puppet run) [14:01:18] <3 [14:01:48] also, is it supposed to be using helm3? I might be confused by some emails I've seen lately about helm3 migration [14:02:10] kostajh: can you retry? [14:02:26] looks good now [14:02:27] thanks! [14:04:25] kostajh: for the helm question, IIUC tomorrow the codfw cluster will be depooled and all services will be deployed via helm 3, then I suppose all the eqiad ones. From your point of view it shouldn't change anything, you'll notice it after running helmfile in the output [14:04:33] kostajh: for staging deploy it already used helm3 and it will do so for prod without your doing when the migration is completed [14:04:57] what he said :p [14:10:01] thanks both [14:10:09] eqiad/codfw deploys seem fine [14:13:48] <_joe_> kostajh: you can tell if it's using helm2 or helm3 based on the helmfile apply output [14:13:55] <_joe_> helm3 doesn't list resources [14:41:47] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) p:05Triage→03High [14:59:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) a:03JMeybohm [15:33:05] <_joe_> jayme: given you're the team's official kibana expert, can I ask you assistance in creating a new service dashboard? [15:33:26] how the f* did I became that?! :-o [15:33:42] <_joe_> well, let's put it this way [15:33:48] <_joe_> you're definitely better than me [15:33:59] <_joe_> alex has lost his mojo [15:34:30] <_joe_> so, TZ-wise, your only remaining adversaries are jelto and arnoldokoth for the title :) [15:34:36] I bet ef.fi is great with kibana :P [15:34:46] <_joe_> I decided the pain should be inflicted on the most senior person first [15:34:55] <_joe_> yeah but also 502 atm [15:34:58] * jayme still pointing to effie [15:35:05] fair point :) [15:35:20] damn...now I highlighed her as well *runns* [15:35:39] so - how can I be of help? :) [15:36:44] <_joe_> ahah jokes aside, have you ever created a dashboard for an application on k8s? [15:37:03] <_joe_> I understand how to query kibana (-ish) [15:37:11] * elukey adds Janis as Kibana expert into his own work facts [15:37:18] <_joe_> but dashboard creation seems like a dark art [15:37:30] I did more or less build the ones we have for calico and k8s events :/ [15:37:37] it *is [15:37:41] dark art [15:37:50] <_joe_> elukey: well done [15:38:01] and it takes a huge amount of time (mostly waiting, tbh) [15:38:16] <_joe_> with logstash involved, it always takes waiting [15:38:19] and swearing - a lot [15:38:24] <_joe_> that too [15:38:41] candles and blood of someones first born might help as well... [15:39:05] I might have some left...so what exactly are you trying to do? [15:42:26] that's a clear 408 o/ [15:42:28] <_joe_> I want a dashboard with data elaborated from the access logs of apple-search [15:42:52] <_joe_> sprry I was replying to another channel :D [15:43:53] let me quickly commit something to review for you and I'll take a look :p [15:45:27] <_joe_> uh I just realized logs are still not structured, meh [15:47:57] well, thats unfortunate [15:49:38] effie is on leave, but thank you for considering me :p [15:50:03] hi, sorry, bye o/ :) [15:50:23] hahahaha [15:51:25] _joe_: there is this "App Logs (Kubernetes)" dashboard which might be kind of a starting point. But tbh I always felt like manual discovery being more productive [17:12:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) [23:50:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Shellbox, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling)