[08:20:21] hnowlan: re-image all done. And homer ran on the network devices so it should be fully operational. I merged the cookbook CRs, so next step is to try the rename, move and re-image in 1 go to save time. jayme prepared kubernetes2032 for that, let me know if it's ok to proceed [09:29:08] XioNoX: yep, looks okay to me. Rename stuff in puppet is all done [09:31:22] sounds good, should I walk you through (or someone from your team) after lunch to make sure it's really self service ? [09:33:24] that'd be great [10:41:10] XioNoX: wikikube-worker2001 is safe to repool etc right? [10:59:04] dear SREs, hugh and I will be adding new control planes to kubernetes in eqiad [11:26:22] stevemunene: mind if I merge your puppet change? [11:26:43] go right ahead, thanks :) [11:26:54] no problem [11:50:06] hnowlan: yep, all yours [11:51:48] nice, done [11:53:00] hnowlan: good to try the cookbooks now? [11:53:56] XioNoX: sure! where should I start? [11:56:18] hnowlan: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging but in step 5 also add --move-vlan [11:57:29] hnowlan: let me know where you're running it from, so I can `tail -f` the logs along [12:02:24] XioNoX: I'm on cumin1002, just merging the puppet change now [12:03:13] cool! [12:06:38] alright, proceeding with the rename cookboook [12:09:32] 👀 [12:12:42] so far this is *amazing*. which vlan should I be moving to with the reimage? [12:13:21] hnowlan: it's automatic, it should move from private1-b-codfw, to private1-b5-codfw [12:13:27] rename cookbook done, success! [12:13:32] awesome ! [12:19:23] got an error on the sre.hosts.move-vlan step unfortunately [12:19:40] XioNoX: want me to paste the full error? [12:19:41] ok, looking [12:19:44] sure [12:21:20] https://phabricator.wikimedia.org/P63430 [12:26:18] hnowlan: easy fix :) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1036642 [12:27:05] lgtm [12:27:21] thx, waiting for CI [12:44:35] hnowlan: my internet is acting up but the patch is deployed [12:48:26] you can re-run it anytime [12:50:30] dealing with the k8s master incident first [12:59:51] api graphs are looking a lot better [12:59:55] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=now-1h&to=now [13:00:24] still unclear as to what actually happened [13:01:55] I wonder if a surge in updates from adding the new host and a deploy happening on top of it crushed the old masters [13:03:28] I am tempted to say we're resolved [13:04:15] I am tempted to agree [13:04:31] yeah, nothing looks bad on my side (even if late to the party) [13:05:08] looking at logstash k8s events [13:07:14] OK to let deployers deploy? [13:07:23] I think so [13:07:23] yeah, I think so [13:08:05] ok, thanks [13:13:23] failed to sync configmap cache: timed out waiting for the condition [13:13:31] ok, that might be the proper hint [13:14:23] as in, many pods emitted things through the time window of 12:01 to 12:39 UTC. In 2 batches in fact, the former longer [13:14:47] total appears to be ~16k events? [13:15:22] no more, ~16k events in mw-api-int alone. In all namespaces we were close to ~80k [13:16:12] we also have a still continuing sudden increase in something I don't understand yet: https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=1716891187269&to=1716901987269&viewPanel=33 [13:16:41] normally these are warnings and don't impact anything [13:17:35] maybe though (so my current theory) they ended up being unable to get their configmaps for a long time and ended up aborting the deployment [13:17:44] kamila_: I think that's normal, you can see an equivalent increase when we added k8s masters for codfw [13:17:59] what caused the surge in events? [13:18:07] oh, ok, thanks hnowlan [13:19:24] hnowlan: nothing in SAL lines up time wise... [13:22:17] akosiaris: ? [13:26:51] I see a lot of 409s in https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?forceLogin&from=now-30m&orgId=1&to=now [13:26:59] which is HTTP status code for conflict [13:27:33] they aren't unheard of btw [13:27:47] nor do I see a volume that's unlike previous events [13:27:52] https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4/edit?usp=sharing [13:28:35] but API latencies are quite higher than usual [13:29:05] are we maxing out etcd again? [13:29:19] we are midst adding new kubemasters [13:29:35] now we are in a state where we have 5 etcd servers [13:29:46] is the etcd cluster status healthy? [13:29:46] the new nodes are stacked, right? [13:30:01] sorry, 4 etcd hosts [13:30:30] apiserver is at 100% [13:30:40] kubemaster1002 that is, and CPU wise looking at others [13:30:55] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=now-6h&to=now&viewPanel=28 [13:32:10] * kamila_ needs a few min to finish lunch, but is here, just eating instead of updating incident doc [13:32:49] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes [13:33:13] vary it to kubemaster1002 and it's a similar pic [13:33:18] albeit shifted in time [13:34:27] those are the old hosts right akosiaris ? [13:34:40] uh I did a fuckup, I failed to make a copy '^^ [13:35:02] of incident doc [13:35:19] still current and surving traffic, but yes, if you want to distinguish them from the ones added today, yes they are the old ones [13:36:08] actual incident doc https://docs.google.com/document/d/1vTSjrC7Pm4vRhv0tE6xjJGPo5SG-KPEytB2CpASzoVc/edit?usp=sharing , please switch to that one [13:36:11] akosiaris: ^ [13:36:43] akosiaris: can we force the k8s master over to one of the new hosts [13:36:56] they 've just been added, they aren't proven to work ok [13:37:01] in fact, quite the contrary [13:37:27] akosiaris: wikikube-ctrl1001 should be fully functional [13:37:40] we completed this one [13:37:49] well, the incidents indicate that maybe not ? [13:38:02] at least it's not unreasonable to call it suspect [13:38:06] i mean it looks like we're maxing out the CPU of one of the vms during a deploy [13:38:09] is that incorrect? [13:38:30] yeah, but that's something that just happened after adding more kubemasters to the cluster [13:38:41] it wasn't happening a couple of days ago [13:38:54] akosiaris: we are half way to finishing the second one [13:39:12] I would argue that, after rollbacl is done, we finish up number 2 [13:39:56] cluster-health on kubetcd1004 says cluster is unhealth [13:40:05] member 3374c2901e8cf8a9 is unreachable: [https://wikikube-ctrl1001.eqiad.wmnet:2379] are all unreachable [13:40:23] I am gonna say this isn't a sign of the new masters working ok [13:40:30] not at all [13:41:07] kamila_: I am randomly pasting findings in the doc btw, just marking them with the time of the finding [13:41:14] I 'll iron them out later [13:41:23] but please, do ask about anything that doesn't make sense [13:41:29] thank you akosiaris , appreciated <3 [13:41:37] failed to check the health of member 8d6d40beec490cfc on https://wikikube-ctrl1002.eqiad.wmnet:2379: Get "https://wikikube-ctrl1002.eqiad.wmnet:2379/health": dial tcp [2620:0:861:102:10:64:16:93]:2379: i/o timeout [13:41:44] akosiaris: etcd is only listening on v4 [13:41:47] second node isn't per kubetcd well either [13:41:49] LISTEN 0 1024 10.64.48.45:2379 0.0.0.0:* users:(("etcd",pid=1082374,fd=9)) [13:41:51] LISTEN 0 1024 10.64.48.45:2380 0.0.0.0:* users:(("etcd",pid=1082374,fd=8)) [13:42:17] cdanis: perhaps we need a puppet run + restart ? [13:42:28] I can do it [13:42:34] I am already on the host [13:42:37] no wait [13:42:43] it isn't even on the old host [13:43:03] ? [13:43:21] akosiaris: on kubeetcd1004 etc [13:43:50] effie: ok, please put a global scap lock with a proper message while we figure this one out, no point in having deployers go through this [13:44:03] so at least let's save them the trouble [13:44:08] right away yes [13:45:25] akosiaris: the dial tcp error you posted, where is it from? [13:45:48] cdanis: etcdctl cluster-health command from kubetcd1004 [13:46:29] kamila_: scap lock is now in place per effie [13:47:03] cdanis: all 3 of them actually [13:47:05] akosiaris: we were discussing that, we added nodes to etcd using their hostnames [13:47:07] all old 3 of them [13:47:22] and by old I mean old etcd nodes, the dedicated ones [13:48:03] going back to what we already discussed, could it be possible that they are trying to reach the ipv6 addr of the wikikube-ctrl ones ? [13:48:12] while thy are listening to v4 ? [13:48:30] akosiaris: I am going to change your times to UTC [13:48:31] In the doc [13:48:35] marostegui: thanks! [13:48:40] forgot about that completely [13:48:44] effie: that's how they are configured [13:49:29] thanks for the scap lock [13:49:49] ok ok [13:50:13] ok first thing's first, apparently firewall for etcd isn't open on wikikube-ctrl1001 [13:50:56] restart firm? (funny I just run iptables -nL) [13:51:04] 2024-05-28T11:54:15.815634+00:00 wikikube-ctrl1001 puppet-agent[1077856]: (/Stage[main]/Ferm/Service[ferm]) Triggered 'refresh' from 7 events [13:51:10] !log run puppet and restart ferm [13:51:11] so it did get restarted 2 hours ago by puppet [13:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:17] !log run puppet and restart ferm on wikikube-ctrl1001 [13:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] I added an action item to review codfw as well and see why this has bitten us here and where codfw differed [13:52:15] so, there is no ferm rule for port 2379 on the new hosts [13:52:39] I suppose that's because apiserver talks directly to localhost, that's my recollection when discussing with Janis [13:52:47] May 28 11:51:31 wikikube-ctrl1001 puppet-agent[1077856]: (/Stage[main]/Profile::Kubernetes::Master/Ferm::Service[apiserver-https]/File[/etc/ferm/conf.d/10_apiserver-https]) Scheduling [13:52:50] refresh of Service[ferm] [13:53:02] but it is biting our ability to debug things so we may just have to add it [13:53:13] &SERVICE(tcp, 6443); [13:53:44] ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:6443 [13:54:07] is [13:54:12] are we not accepting v6 traffic to the api? [13:54:18] !log add manually ferm client rule on wikikube-ctrl1002 and disable puppet [13:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:48] member 8d6d40beec490cfc is healthy: got healthy result from https://wikikube-ctrl1002.eqiad.wmnet:2379 [13:54:56] doing same thing on wikikube-ctrl1001 [13:56:31] ok, now cluster-health reports that everything is fine at least. Those ferm rules might not have even been the issue, but they were interfering with cluster-health command [13:58:07] akosiaris: I think you're right and that was a red herring that was getting in the way of debugging [13:58:59] 💔cdanis@wikikube-ctrl2001.codfw.wmnet ~ 🕙☕ sudo etcdctl --endpoints https://wikikube-ctrl2001.codfw.wmnet:2379 cluster-health [13:59:01] member 12dfea4341e87158 is healthy: got healthy result from https://kubetcd2004.codfw.wmnet:2379 [13:59:03] member 2cab88b3bb2a9741 is healthy: got healthy result from https://wikikube-ctrl2001.codfw.wmnet:2379 [13:59:05] member 5d590352bc3189ab is healthy: got healthy result from https://kubetcd2006.codfw.wmnet:2379 [13:59:07] failed to check the health of member a12e748f437066e6 on https://wikikube-ctrl2002.codfw.wmnet:2379: Get "https://wikikube-ctrl2002.codfw.wmnet:2379/health": dial tcp [2620:0:860:103:10:192:32:105]:2379: i/o timeout [13:59:09] member a12e748f437066e6 is unreachable: [https://wikikube-ctrl2002.codfw.wmnet:2379] are all unreachable [13:59:11] member b6bd69df4873a5a8 is healthy: got healthy result from https://kubetcd2005.codfw.wmnet:2379 [13:59:13] failed to check the health of member d9d70e528a11de5e on https://wikikube-ctrl2003.codfw.wmnet:2379: Get "https://wikikube-ctrl2003.codfw.wmnet:2379/health": dial tcp [2620:0:860:106:10:192:5:10]:2379: i/o timeout [13:59:15] member d9d70e528a11de5e is unreachable: [https://wikikube-ctrl2003.codfw.wmnet:2379] are all unreachable [13:59:17] cluster is degraded [13:59:19] etcdctl cluster-health on codfw reports similarly but it's working [13:59:21] so [14:00:18] I kinda think that the trigger here (aside from the deploy) was just the added load in needing to keep the new hosts synced as well [14:00:52] it's the only thing I've thought of that fits all the data [14:01:41] yeah, adding nodes is expected to increase load, but I don't know by how much [14:01:46] !log remove wikikube-ctrl1002 from the rotation to test a theory [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:08] !log roll restart mw-api-int pods [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:32] ok, I am seeing kubemaster1002 at 100% CPU already [14:05:01] kubemaster1001 isn't really breaking a sweat though [14:05:30] akosiaris: I will sort out the codfw side of things [14:05:50] ping me when you are done so I can finish up the stacked masters work because we are right in the middle of it [14:06:03] done... I am just testing a theory [14:06:21] no problem, you reproduced the problem [14:06:44] kubemaster1002 still maxed on CPU lol [14:09:57] it's the apiserver maxing out but kubemaster1002 isn't even the lease holder for `kube-scheduler` or `kube-controller-manager` [14:10:54] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1002&var-datasource=thanos&var-cluster=kubernetes&from=now-1h&to=now&viewPanel=31 [14:10:56] tcp retransmits??? [14:10:59] cdanis: isn't that weird? [14:11:09] !log restart kube-apiserver on kubemaster1002 [14:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] like 3k retransmits a second on kubemaster1002 [14:11:46] akosiaris: jave you added back 1002 ? [14:11:48] that suggests some *other* firewall issue imo [14:12:17] the roll restart finished btw, but in the process I did lower weight of kubemaster1002 to 1 from 10 [14:12:30] can't say I witnessed load shifting though [14:13:01] hmmm [14:13:08] the retransmits are new [14:13:33] so kubemaster1002 is pointed to the 3 old etcd servers. Which ... should be fine ? [14:13:37] akosiaris: do we have reverse DNS somewhere for k8s pod IP addresses? [14:13:48] --etcd-servers=https://kubetcd1004.eqiad.wmnet:2379,https://kubetcd1005.eqiad.wmnet:2379,https://kubetcd1006.eqiad.wmnet:2379 [14:14:17] cdanis: no, we did have a placeholder at some point that said the equivalent of "thisisak8spod" [14:14:28] but nothing more than that [14:14:51] we have ofc delegate a zone to coredns but it has never been a priority [14:15:38] kubemaster1001 had TCP retransmits though the previous week btw, but amount in the 10s, not the 1000s [14:16:05] https://grafana.wikimedia.org/goto/l5RVJiySg?orgId=1 [14:16:07] it does look to be a new problem [14:16:20] https://grafana-rw.wikimedia.org/goto/l5RVJiySg?orgId=1 if that doesn't work [14:16:34] thousands of segments a second is definitely a problem [14:16:42] akosiaris: can you reproduce the issue again? [14:17:06] on kubemaster1002 I have running `sudo tshark -Y tcp.analysis.retransmission` [14:17:58] uhm, I just rebooted it to add another 4 vcpus [14:18:09] but I can try once it's up and running [14:18:22] akosiaris: the TCP retransmits seem to correlate afaict, would be useful to confirm [14:18:43] agreed [14:18:48] https://gerrit.wikimedia.org/r/1036680 akosiaris cdanis [14:19:29] !log add another 4 vcpus to kubemaster1002 [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] cdanis: host is up [14:21:12] wanna restart tshark and I try to reproduce? [14:21:31] akosiaris: shall I deploy the puppet change first ? [14:22:16] effie: go ahead, I think it's unrelated to trying to reproduce this [14:22:33] akosiaris: no argument, just to at least sort one thing at a time [14:22:58] go ahead [14:24:25] cdanis: btw, this is weird: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1002&var-datasource=thanos&var-cluster=kubernetes&viewPanel=8 [14:24:37] the VM maxed out the veth [14:24:45] in all 3 occasions [14:24:52] akosiaris: I am enabling puppet on wikikube hosts [14:24:58] effie: thanks! [14:25:04] log it please [14:25:29] looking back 30days, kubemaster1002 has never maxed out the nic [14:25:39] akosiaris: well I guess that explains the retransmits [14:25:41] cdanis: that might explain the retransmits btw [14:25:57] I can't wait until we have sflow data [14:26:10] same thing for kubemaster1001 earlier today [14:26:25] went up to 160MBs per https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes&viewPanel=8&from=now-3h&to=now [14:26:40] which doesn't exactly compute on a 1Gbps link but whatever [14:26:54] akosiaris: kubemaster1001 was also sending a fair bit of traffic while kubemaster1002 was maxing out [14:27:02] I am gonna attribute the retransmits to maxing out the nic [14:27:08] might be cpu bound if it's a VM? would be weird but not unbelievable [14:27:21] yeah kubemaster1002 looked quite cpu bound [14:27:26] so that's not too surprising I guess [14:27:31] anyway, I 'll try and roll_restart mw-api-int again and see what happens with 8vcpus [14:27:33] the question is why ofc :D [14:27:57] ok, kubemaster1001 just got maxed out [14:28:11] cdanis: akosiaris what about I contiu and fully add wikikube-ctrl1002 [14:28:12] I am gonna switch it to depooled in pybal [14:29:01] akosiaris: wdyt? [14:29:13] !log depool kubemaster1001, it's CPU is saturated after a test roll restart [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:23] kubemaster1001 is un-saturating now lol [14:29:37] and kubemaster1002 is saturating now [14:29:38] kubemaster1002 just saturated akosiaris [14:29:46] even at 8 vcpus [14:29:53] what on earth is going on suddenly? [14:30:10] 735840 129.174369713 10.64.32.215 → 10.64.32.116 TCP 725 [TCP Retransmission] 2379 → 41422 [PSH, ACK] Seq=32428618 Ack=6189478 Win=1642 Len=659 TSval=3628286519 TSecr=2711688260 [14:30:11] !log repool kubemaster1001, testing something [14:30:13] 735990 129.183597149 10.64.132.27 → 10.2.2.8 TCP 245 [TCP Retransmission] 52464 → 6443 [PSH, ACK] Seq=41565 Ack=722893 Win=1223680 Len=179 TSval=3696948241 TSecr=2712943532 [14:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] 736027 129.187059896 10.64.48.195 → 10.2.2.8 TCP 246 [TCP Retransmission] 42014 → 6443 [PSH, ACK] Seq=60442 Ack=744378 Win=1255424 Len=180 TSval=3841928 TSecr=1128001387 [14:30:17] 736210 129.202095669 10.64.16.232 → 10.2.2.8 TCP 101 [TCP Retransmission] 46140 → 6443 [PSH, ACK] Seq=51280 Ack=438328 Win=192000 Len=35 TSval=3489838887 TSecr=2879749280 [14:30:19] 736416 129.234338349 10.64.32.215 → 10.64.32.116 TCP 105 [TCP Retransmission] 2379 → 41534 [PSH, ACK] Seq=81478 Ack=171514 Win=786 Len=39 TSval=3628286579 TSecr=2711688339 [14:30:21] 736530 129.283270854 10.64.16.178 → 10.64.32.116 TCP 287 [TCP Retransmission] 2379 → 57118 [PSH, ACK] Seq=2354558 Ack=214121 Win=881 Len=221 TSval=558139462 TSecr=1485612713 [14:30:23] 736868 129.336803710 10.64.48.225 → 10.2.2.8 TCP 136 [TCP Retransmission] 54154 → 6443 [PSH, ACK] Seq=49145 Ack=757522 Win=1244160 Len=70 TSval=2714440602 TSecr=307014078 [14:30:36] so that looks like traffic to *other* k8s api servers and also from etcd that is being retransmitted [14:30:39] so kinda everything [14:31:23] adding kubemaster1001 to the pool once more and things are a bit more balanced [14:31:39] the roll restart errored out though [14:31:41] I don't have much context here but I wanted to share https://phabricator.wikimedia.org/P63443. can these be related in any way? these changes were done on netbox but not committed [14:31:52] Error: UPGRADE FAILED: release canary failed, and has been rolled back due to atomic being set: unexpected error when reading response body. Please retry. Original error: http2: client connection lost [14:31:58] sukhe: doubt it [14:32:00] sukhe: unrelated [14:32:02] ok [14:32:28] nic of kubemaster1001 maxed out again [14:32:34] sukhe: this is part of the host renaming work on codfw [14:32:44] thank you very much though [14:32:45] noted [14:33:15] so, even without the CPU thing, this is worrying [14:33:44] if we are maxing out nics, that's not good [14:35:25] kubemaster1002 is running on ganeti1010, one of the old nodes which only has a 1G interface, we could migrate it to one of the more recent nodes with 10G? [14:35:36] I 'll add another 4 vcpus to kubemaster1001 just to be consistent [14:36:09] moritzm: I suppose so, although we probably still want to explain why suddenly we went from 40MB/s to 120+MB/s [14:36:19] let me do both and see how we fare [14:37:25] !log repool kubemaster1001 with 8 vpus for consistency with kubemaster1002. [14:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] !log reboot kubemaster1001 with 8 vpus for consistency with kubemaster1002. [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:50] ack. ganeti1037 is also in row C, has a 10G NIC and sufficient free RAm [14:37:58] I don't know if we are having retransmits because we're saturating the interface or we're saturating the interface with retransmits [14:38:20] moritzm: cool, I was about to ask, thanks for the host names :-) [14:39:05] or ganeti1028 would also work [14:39:31] cdanis: that ^ btw explains why kubemaster1001 was able to reach 160MB/s [14:39:41] it's on ganeti1030 and ganeti1030 has 10Gbps nic [14:40:01] hah [14:40:09] that's right, we did that for the ganeti machines at some point [14:40:14] effie: akosiaris: https://grafana-rw.wikimedia.org/goto/CJ8JxmsIg?orgId=1 [14:40:45] I think forthcoming refresh of Ganeti in eqiad/codfw next FY will replace the remaining legacy servers with 1G [14:40:54] interestingly there have been some similar network spikes in codfw as well? [14:41:10] https://grafana-rw.wikimedia.org/goto/O1_-bisIR?orgId=1 [14:42:10] so I guess the codfw spikes were deploys https://grafana-rw.wikimedia.org/goto/Nx1LxmySg?orgId=1 [14:44:31] !log gnt-instance replace-disks for kubemaster1002, set ganeti1037 as a secondary [14:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] ~10m for this one [14:45:33] still, all we are doing is bumping limits up to now [14:45:54] not explaining what has happened and suddenly we needed 2.5 times more bandwidth [14:45:56] I still don't know whether the retransmits and saturation are causes or effects [14:45:58] or CPU [14:46:08] I think we're just saturating on everything and we were close to saturating before [14:46:18] and adding more hosts to the control plane means we're saturating earlier [14:46:20] akosiaris: still, should we just finish up adding 1002 ? [14:46:30] https://grafana-rw.wikimedia.org/goto/91LjxmsSR?orgId=1 [14:46:39] the high network utilization isn't new, it just usually was lower and/or shorter [14:46:40] they are proper machines and we wont have to play resource wakamole [14:46:55] I think we should validate that the new hosts work ASAP and take the old hosts out of rotation [14:46:56] effie: I 'd rather we kept variables stable while we debug this [14:47:18] at least until we have raised the cap on the net saturation too and see if those 2 were the issues [14:47:32] I think that hypothesis makes sense [14:47:35] akosiaris: but what if we want to validate the hypothesis that having both the old hosts and the new hosts be part of the control plane is the triggering issue [14:47:39] effie: what's left regarding 1002 btw? [14:47:45] and 1003 for that matter? [14:47:54] I see 1003 isn't even an etcd node yet, is that correct? [14:48:11] akosiaris: bgp stuff and calling it a master noe [14:48:20] 1003 has puppet disabled [14:48:36] effie: calling it a master node? [14:48:48] akosiaris: labeling it [14:48:59] kubemaster1001 is the one with the 1G NIC in hardware right? [14:49:02] can we just ... turn it off [14:49:06] +1 [14:49:10] cdanis: we can, but I am trying to understand what that validation would prove [14:49:18] cdanis: 1002 and I am already migrating it away [14:49:35] I need like 6m and it will be on a 10G host [14:50:02] Tue May 28 14:49:18 2024 - INFO: - device disk/0: 46.10% done, 6m 6s remaining (estimated) [14:50:39] my main worry is that we rush pushing forward to the new master, find out something's borked and will not have a way back fwiw [14:50:40] akosiaris: I am far from sure that the NIC is the limiting factor [14:51:18] kamila_: it wasn't earlier for sure. Something has changed and suddenly it is [14:51:28] at least it's the trigger of those retransmits [14:51:34] cause kubemaster1001 doesn't have them [14:51:37] and it's on a 10G host [14:51:45] so at least we rule out 1 thing [14:51:54] "we added more load and hit the limit" seems like a plausible hypothesis to me [14:51:56] now what on earth is the trigger for all of this, I don't know [14:52:14] ok, let's see, thanks akosiaris [14:52:18] we are talking here x2.5 more traffic [14:52:48] and funnily enough, etcd shouldn't be playing a factor here (at least per configuration of apiserver) [14:53:15] cdanis: does your tshark have the IP addresses of wikikube-ctrl100{1,2} in them btw? [14:53:19] akosiaris: yes [14:53:22] especially for port 2379 [14:53:23] 2.5x more traffic and we added new hosts [14:53:23] ? [14:53:26] yes [14:53:59] btw akosiaris CPU saturation during rollouts was an issue before too https://grafana.wikimedia.org/goto/oPVFamsSR?orgId=1 [14:55:16] is someone free to take over keeping the incident doc up to date? my RSI is complaining :-/ [14:55:27] kamila_: yeah I'll take it [14:55:32] thank you <3 [14:55:36] cdanis: yes on kubemaster1002 talking to port 2379/tcp wikikube-ctrl100{1,2} ? [14:56:09] hm [14:56:10] akosiaris: no [14:56:16] 💙cdanis@kubemaster1002.eqiad.wmnet ~ 🕚☕ sudo ss -t | grep -i 2379 | awk '{print $4}' | cut -d: -f1 | sort | uniq -c [14:56:18] 247 10.64.32.116 [14:56:23] Tue May 28 14:55:18 2024 - INFO: Instance kubemaster1002.eqiad.wmnet's disks are in sync [14:56:26] it's all talking to itself (kubemaster1002) [14:56:30] ok moving over and testing again [14:56:58] !log migrate kubemaster1002 to ganeti1037 [14:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:19] Tue May 28 15:00:06 2024 * memory transfer progress: 101.92 % [15:00:23] ok, never seen that before [15:00:28] hah [15:00:36] Tue May 28 15:00:28 2024 * memory transfer progress: 113.26 % [15:00:38] oh that's nice [15:00:44] insert red october meme [15:01:20] I know what this is btw [15:01:30] or at least I think I do [15:02:20] kubemaster1001 just saturated [15:02:23] briefly [15:03:04] all I did was to restart kube-apiserver on kubemaster1002 fwiw [15:03:20] ok so just startup then probably [15:03:21] which adds credence to your theory that we were maxing out on both [15:03:54] VM migration done, I 'll start up the test shortly [15:05:04] akosiaris: uhhh page? [15:05:14] akosiaris: while it is not the problem obviously, we do have etcd nodes contantly talking first to v6 to each other [15:05:37] kubemaster1001 maxed out on CPU [15:05:39] hi, still reading scrollback [15:05:46] and is spewing about 2Gbps of network tx [15:05:58] cdanis: kubemaster1001 CPU saturating. at 8 vcpus [15:06:06] kubemaster1002 is faring ok-ish [15:06:47] akosiaris: wikikube-ctrl1001,2 are each receiving about 500Mbit/s or more [15:07:01] from who? [15:07:07] I'm guessing from kubemaster1001 [15:07:25] how? it's not even configured to use them [15:07:38] it would make sense if it was the other 3 kubetcds [15:07:45] and it was a misconfiguration issue [15:07:48] Acquire Time: 2024-05-28T12:38:18.055302Z [15:07:51] Holder Identity: wikikube-ctrl1001_68459a11-c7dd-407b-9041-02f1c9d6bacc [15:07:53] are you sure [15:08:00] the kube-scheduler lease is owned by wikikube-ctrl1001 [15:08:08] so is kube-controller-manager [15:08:22] ok, that one makes sense [15:08:34] but 1002 shouldn't receive traffic then [15:08:54] it's still part of the etcd cluster isn't it? [15:09:15] I mean receive traffic from kubemasters [15:09:33]