[08:20:21] hnowlan: re-image all done. And homer ran on the network devices so it should be fully operational. I merged the cookbook CRs, so next step is to try the rename, move and re-image in 1 go to save time. jayme prepared kubernetes2032 for that, let me know if it's ok to proceed [09:29:08] XioNoX: yep, looks okay to me. Rename stuff in puppet is all done [09:31:22] sounds good, should I walk you through (or someone from your team) after lunch to make sure it's really self service ? [09:33:24] that'd be great [10:41:10] XioNoX: wikikube-worker2001 is safe to repool etc right? [10:59:04] dear SREs, hugh and I will be adding new control planes to kubernetes in eqiad [11:26:22] stevemunene: mind if I merge your puppet change? [11:26:43] go right ahead, thanks :) [11:26:54] no problem [11:50:06] hnowlan: yep, all yours [11:51:48] nice, done [11:53:00] hnowlan: good to try the cookbooks now? [11:53:56] XioNoX: sure! where should I start? [11:56:18] hnowlan: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging but in step 5 also add --move-vlan [11:57:29] hnowlan: let me know where you're running it from, so I can `tail -f` the logs along [12:02:24] XioNoX: I'm on cumin1002, just merging the puppet change now [12:03:13] cool! [12:06:38] alright, proceeding with the rename cookboook [12:09:32] πŸ‘€ [12:12:42] so far this is *amazing*. which vlan should I be moving to with the reimage? [12:13:21] hnowlan: it's automatic, it should move from private1-b-codfw, to private1-b5-codfw [12:13:27] rename cookbook done, success! [12:13:32] awesome ! [12:19:23] got an error on the sre.hosts.move-vlan step unfortunately [12:19:40] XioNoX: want me to paste the full error? [12:19:41] ok, looking [12:19:44] sure [12:21:20] https://phabricator.wikimedia.org/P63430 [12:26:18] hnowlan: easy fix :) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1036642 [12:27:05] lgtm [12:27:21] thx, waiting for CI [12:44:35] hnowlan: my internet is acting up but the patch is deployed [12:48:26] you can re-run it anytime [12:50:30] dealing with the k8s master incident first [12:59:51] api graphs are looking a lot better [12:59:55] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=now-1h&to=now [13:00:24] still unclear as to what actually happened [13:01:55] I wonder if a surge in updates from adding the new host and a deploy happening on top of it crushed the old masters [13:03:28] I am tempted to say we're resolved [13:04:15] I am tempted to agree [13:04:31] yeah, nothing looks bad on my side (even if late to the party) [13:05:08] looking at logstash k8s events [13:07:14] OK to let deployers deploy? [13:07:23] I think so [13:07:23] yeah, I think so [13:08:05] ok, thanks [13:13:23] failed to sync configmap cache: timed out waiting for the condition [13:13:31] ok, that might be the proper hint [13:14:23] as in, many pods emitted things through the time window of 12:01 to 12:39 UTC. In 2 batches in fact, the former longer [13:14:47] total appears to be ~16k events? [13:15:22] no more, ~16k events in mw-api-int alone. In all namespaces we were close to ~80k [13:16:12] we also have a still continuing sudden increase in something I don't understand yet: https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=1716891187269&to=1716901987269&viewPanel=33 [13:16:41] normally these are warnings and don't impact anything [13:17:35] maybe though (so my current theory) they ended up being unable to get their configmaps for a long time and ended up aborting the deployment [13:17:44] kamila_: I think that's normal, you can see an equivalent increase when we added k8s masters for codfw [13:17:59] what caused the surge in events? [13:18:07] oh, ok, thanks hnowlan [13:19:24] hnowlan: nothing in SAL lines up time wise... [13:22:17] akosiaris: ? [13:26:51] I see a lot of 409s in https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?forceLogin&from=now-30m&orgId=1&to=now [13:26:59] which is HTTP status code for conflict [13:27:33] they aren't unheard of btw [13:27:47] nor do I see a volume that's unlike previous events [13:27:52] https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4/edit?usp=sharing [13:28:35] but API latencies are quite higher than usual [13:29:05] are we maxing out etcd again? [13:29:19] we are midst adding new kubemasters [13:29:35] now we are in a state where we have 5 etcd servers [13:29:46] is the etcd cluster status healthy? [13:29:46] the new nodes are stacked, right? [13:30:01] sorry, 4 etcd hosts [13:30:30] apiserver is at 100% [13:30:40] kubemaster1002 that is, and CPU wise looking at others [13:30:55] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=now-6h&to=now&viewPanel=28 [13:32:10] * kamila_ needs a few min to finish lunch, but is here, just eating instead of updating incident doc [13:32:49] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes [13:33:13] vary it to kubemaster1002 and it's a similar pic [13:33:18] albeit shifted in time [13:34:27] those are the old hosts right akosiaris ? [13:34:40] uh I did a fuckup, I failed to make a copy '^^ [13:35:02] of incident doc [13:35:19] still current and surving traffic, but yes, if you want to distinguish them from the ones added today, yes they are the old ones [13:36:08] actual incident doc https://docs.google.com/document/d/1vTSjrC7Pm4vRhv0tE6xjJGPo5SG-KPEytB2CpASzoVc/edit?usp=sharing , please switch to that one [13:36:11] akosiaris: ^ [13:36:43] akosiaris: can we force the k8s master over to one of the new hosts [13:36:56] they 've just been added, they aren't proven to work ok [13:37:01] in fact, quite the contrary [13:37:27] akosiaris: wikikube-ctrl1001 should be fully functional [13:37:40] we completed this one [13:37:49] well, the incidents indicate that maybe not ? [13:38:02] at least it's not unreasonable to call it suspect [13:38:06] i mean it looks like we're maxing out the CPU of one of the vms during a deploy [13:38:09] is that incorrect? [13:38:30] yeah, but that's something that just happened after adding more kubemasters to the cluster [13:38:41] it wasn't happening a couple of days ago [13:38:54] akosiaris: we are half way to finishing the second one [13:39:12] I would argue that, after rollbacl is done, we finish up number 2 [13:39:56] cluster-health on kubetcd1004 says cluster is unhealth [13:40:05] member 3374c2901e8cf8a9 is unreachable: [https://wikikube-ctrl1001.eqiad.wmnet:2379] are all unreachable [13:40:23] I am gonna say this isn't a sign of the new masters working ok [13:40:30] not at all [13:41:07] kamila_: I am randomly pasting findings in the doc btw, just marking them with the time of the finding [13:41:14] I 'll iron them out later [13:41:23] but please, do ask about anything that doesn't make sense [13:41:29] thank you akosiaris , appreciated <3 [13:41:37] failed to check the health of member 8d6d40beec490cfc on https://wikikube-ctrl1002.eqiad.wmnet:2379: Get "https://wikikube-ctrl1002.eqiad.wmnet:2379/health": dial tcp [2620:0:861:102:10:64:16:93]:2379: i/o timeout [13:41:44] akosiaris: etcd is only listening on v4 [13:41:47] second node isn't per kubetcd well either [13:41:49] LISTEN 0 1024 10.64.48.45:2379 0.0.0.0:* users:(("etcd",pid=1082374,fd=9)) [13:41:51] LISTEN 0 1024 10.64.48.45:2380 0.0.0.0:* users:(("etcd",pid=1082374,fd=8)) [13:42:17] cdanis: perhaps we need a puppet run + restart ? [13:42:28] I can do it [13:42:34] I am already on the host [13:42:37] no wait [13:42:43] it isn't even on the old host [13:43:03] ? [13:43:21] akosiaris: on kubeetcd1004 etc [13:43:50] effie: ok, please put a global scap lock with a proper message while we figure this one out, no point in having deployers go through this [13:44:03] so at least let's save them the trouble [13:44:08] right away yes [13:45:25] akosiaris: the dial tcp error you posted, where is it from? [13:45:48] cdanis: etcdctl cluster-health command from kubetcd1004 [13:46:29] kamila_: scap lock is now in place per effie [13:47:03] cdanis: all 3 of them actually [13:47:05] akosiaris: we were discussing that, we added nodes to etcd using their hostnames [13:47:07] all old 3 of them [13:47:22] and by old I mean old etcd nodes, the dedicated ones [13:48:03] going back to what we already discussed, could it be possible that they are trying to reach the ipv6 addr of the wikikube-ctrl ones ? [13:48:12] while thy are listening to v4 ? [13:48:30] akosiaris: I am going to change your times to UTC [13:48:31] In the doc [13:48:35] marostegui: thanks! [13:48:40] forgot about that completely [13:48:44] effie: that's how they are configured [13:49:29] thanks for the scap lock [13:49:49] ok ok [13:50:13] ok first thing's first, apparently firewall for etcd isn't open on wikikube-ctrl1001 [13:50:56] restart firm? (funny I just run iptables -nL) [13:51:04] 2024-05-28T11:54:15.815634+00:00 wikikube-ctrl1001 puppet-agent[1077856]: (/Stage[main]/Ferm/Service[ferm]) Triggered 'refresh' from 7 events [13:51:10] !log run puppet and restart ferm [13:51:11] so it did get restarted 2 hours ago by puppet [13:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:17] !log run puppet and restart ferm on wikikube-ctrl1001 [13:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] I added an action item to review codfw as well and see why this has bitten us here and where codfw differed [13:52:15] so, there is no ferm rule for port 2379 on the new hosts [13:52:39] I suppose that's because apiserver talks directly to localhost, that's my recollection when discussing with Janis [13:52:47] May 28 11:51:31 wikikube-ctrl1001 puppet-agent[1077856]: (/Stage[main]/Profile::Kubernetes::Master/Ferm::Service[apiserver-https]/File[/etc/ferm/conf.d/10_apiserver-https]) Scheduling [13:52:50] refresh of Service[ferm] [13:53:02] but it is biting our ability to debug things so we may just have to add it [13:53:13] &SERVICE(tcp, 6443); [13:53:44] ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:6443 [13:54:07] is [13:54:12] are we not accepting v6 traffic to the api? [13:54:18] !log add manually ferm client rule on wikikube-ctrl1002 and disable puppet [13:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:48] member 8d6d40beec490cfc is healthy: got healthy result from https://wikikube-ctrl1002.eqiad.wmnet:2379 [13:54:56] doing same thing on wikikube-ctrl1001 [13:56:31] ok, now cluster-health reports that everything is fine at least. Those ferm rules might not have even been the issue, but they were interfering with cluster-health command [13:58:07] akosiaris: I think you're right and that was a red herring that was getting in the way of debugging [13:58:59] πŸ’”cdanis@wikikube-ctrl2001.codfw.wmnet ~ πŸ•™β˜• sudo etcdctl --endpoints https://wikikube-ctrl2001.codfw.wmnet:2379 cluster-health [13:59:01] member 12dfea4341e87158 is healthy: got healthy result from https://kubetcd2004.codfw.wmnet:2379 [13:59:03] member 2cab88b3bb2a9741 is healthy: got healthy result from https://wikikube-ctrl2001.codfw.wmnet:2379 [13:59:05] member 5d590352bc3189ab is healthy: got healthy result from https://kubetcd2006.codfw.wmnet:2379 [13:59:07] failed to check the health of member a12e748f437066e6 on https://wikikube-ctrl2002.codfw.wmnet:2379: Get "https://wikikube-ctrl2002.codfw.wmnet:2379/health": dial tcp [2620:0:860:103:10:192:32:105]:2379: i/o timeout [13:59:09] member a12e748f437066e6 is unreachable: [https://wikikube-ctrl2002.codfw.wmnet:2379] are all unreachable [13:59:11] member b6bd69df4873a5a8 is healthy: got healthy result from https://kubetcd2005.codfw.wmnet:2379 [13:59:13] failed to check the health of member d9d70e528a11de5e on https://wikikube-ctrl2003.codfw.wmnet:2379: Get "https://wikikube-ctrl2003.codfw.wmnet:2379/health": dial tcp [2620:0:860:106:10:192:5:10]:2379: i/o timeout [13:59:15] member d9d70e528a11de5e is unreachable: [https://wikikube-ctrl2003.codfw.wmnet:2379] are all unreachable [13:59:17] cluster is degraded [13:59:19] etcdctl cluster-health on codfw reports similarly but it's working [13:59:21] so [14:00:18] I kinda think that the trigger here (aside from the deploy) was just the added load in needing to keep the new hosts synced as well [14:00:52] it's the only thing I've thought of that fits all the data [14:01:41] yeah, adding nodes is expected to increase load, but I don't know by how much [14:01:46] !log remove wikikube-ctrl1002 from the rotation to test a theory [14:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:08] !log roll restart mw-api-int pods [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:32] ok, I am seeing kubemaster1002 at 100% CPU already [14:05:01] kubemaster1001 isn't really breaking a sweat though [14:05:30] akosiaris: I will sort out the codfw side of things [14:05:50] ping me when you are done so I can finish up the stacked masters work because we are right in the middle of it [14:06:03] done... I am just testing a theory [14:06:21] no problem, you reproduced the problem [14:06:44] kubemaster1002 still maxed on CPU lol [14:09:57] it's the apiserver maxing out but kubemaster1002 isn't even the lease holder for `kube-scheduler` or `kube-controller-manager` [14:10:54] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1002&var-datasource=thanos&var-cluster=kubernetes&from=now-1h&to=now&viewPanel=31 [14:10:56] tcp retransmits??? [14:10:59] cdanis: isn't that weird? [14:11:09] !log restart kube-apiserver on kubemaster1002 [14:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] like 3k retransmits a second on kubemaster1002 [14:11:46] akosiaris: jave you added back 1002 ? [14:11:48] that suggests some *other* firewall issue imo [14:12:17] the roll restart finished btw, but in the process I did lower weight of kubemaster1002 to 1 from 10 [14:12:30] can't say I witnessed load shifting though [14:13:01] hmmm [14:13:08] the retransmits are new [14:13:33] so kubemaster1002 is pointed to the 3 old etcd servers. Which ... should be fine ? [14:13:37] akosiaris: do we have reverse DNS somewhere for k8s pod IP addresses? [14:13:48] --etcd-servers=https://kubetcd1004.eqiad.wmnet:2379,https://kubetcd1005.eqiad.wmnet:2379,https://kubetcd1006.eqiad.wmnet:2379 [14:14:17] cdanis: no, we did have a placeholder at some point that said the equivalent of "thisisak8spod" [14:14:28] but nothing more than that [14:14:51] we have ofc delegate a zone to coredns but it has never been a priority [14:15:38] kubemaster1001 had TCP retransmits though the previous week btw, but amount in the 10s, not the 1000s [14:16:05] https://grafana.wikimedia.org/goto/l5RVJiySg?orgId=1 [14:16:07] it does look to be a new problem [14:16:20] https://grafana-rw.wikimedia.org/goto/l5RVJiySg?orgId=1 if that doesn't work [14:16:34] thousands of segments a second is definitely a problem [14:16:42] akosiaris: can you reproduce the issue again? [14:17:06] on kubemaster1002 I have running `sudo tshark -Y tcp.analysis.retransmission` [14:17:58] uhm, I just rebooted it to add another 4 vcpus [14:18:09] but I can try once it's up and running [14:18:22] akosiaris: the TCP retransmits seem to correlate afaict, would be useful to confirm [14:18:43] agreed [14:18:48] https://gerrit.wikimedia.org/r/1036680 akosiaris cdanis [14:19:29] !log add another 4 vcpus to kubemaster1002 [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] cdanis: host is up [14:21:12] wanna restart tshark and I try to reproduce? [14:21:31] akosiaris: shall I deploy the puppet change first ? [14:22:16] effie: go ahead, I think it's unrelated to trying to reproduce this [14:22:33] akosiaris: no argument, just to at least sort one thing at a time [14:22:58] go ahead [14:24:25] cdanis: btw, this is weird: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1002&var-datasource=thanos&var-cluster=kubernetes&viewPanel=8 [14:24:37] the VM maxed out the veth [14:24:45] in all 3 occasions [14:24:52] akosiaris: I am enabling puppet on wikikube hosts [14:24:58] effie: thanks! [14:25:04] log it please [14:25:29] looking back 30days, kubemaster1002 has never maxed out the nic [14:25:39] akosiaris: well I guess that explains the retransmits [14:25:41] cdanis: that might explain the retransmits btw [14:25:57] I can't wait until we have sflow data [14:26:10] same thing for kubemaster1001 earlier today [14:26:25] went up to 160MBs per https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes&viewPanel=8&from=now-3h&to=now [14:26:40] which doesn't exactly compute on a 1Gbps link but whatever [14:26:54] akosiaris: kubemaster1001 was also sending a fair bit of traffic while kubemaster1002 was maxing out [14:27:02] I am gonna attribute the retransmits to maxing out the nic [14:27:08] might be cpu bound if it's a VM? would be weird but not unbelievable [14:27:21] yeah kubemaster1002 looked quite cpu bound [14:27:26] so that's not too surprising I guess [14:27:31] anyway, I 'll try and roll_restart mw-api-int again and see what happens with 8vcpus [14:27:33] the question is why ofc :D [14:27:57] ok, kubemaster1001 just got maxed out [14:28:11] cdanis: akosiaris what about I contiu and fully add wikikube-ctrl1002 [14:28:12] I am gonna switch it to depooled in pybal [14:29:01] akosiaris: wdyt? [14:29:13] !log depool kubemaster1001, it's CPU is saturated after a test roll restart [14:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:23] kubemaster1001 is un-saturating now lol [14:29:37] and kubemaster1002 is saturating now [14:29:38] kubemaster1002 just saturated akosiaris [14:29:46] even at 8 vcpus [14:29:53] what on earth is going on suddenly? [14:30:10] 735840 129.174369713 10.64.32.215 β†’ 10.64.32.116 TCP 725 [TCP Retransmission] 2379 β†’ 41422 [PSH, ACK] Seq=32428618 Ack=6189478 Win=1642 Len=659 TSval=3628286519 TSecr=2711688260 [14:30:11] !log repool kubemaster1001, testing something [14:30:13] 735990 129.183597149 10.64.132.27 β†’ 10.2.2.8 TCP 245 [TCP Retransmission] 52464 β†’ 6443 [PSH, ACK] Seq=41565 Ack=722893 Win=1223680 Len=179 TSval=3696948241 TSecr=2712943532 [14:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] 736027 129.187059896 10.64.48.195 β†’ 10.2.2.8 TCP 246 [TCP Retransmission] 42014 β†’ 6443 [PSH, ACK] Seq=60442 Ack=744378 Win=1255424 Len=180 TSval=3841928 TSecr=1128001387 [14:30:17] 736210 129.202095669 10.64.16.232 β†’ 10.2.2.8 TCP 101 [TCP Retransmission] 46140 β†’ 6443 [PSH, ACK] Seq=51280 Ack=438328 Win=192000 Len=35 TSval=3489838887 TSecr=2879749280 [14:30:19] 736416 129.234338349 10.64.32.215 β†’ 10.64.32.116 TCP 105 [TCP Retransmission] 2379 β†’ 41534 [PSH, ACK] Seq=81478 Ack=171514 Win=786 Len=39 TSval=3628286579 TSecr=2711688339 [14:30:21] 736530 129.283270854 10.64.16.178 β†’ 10.64.32.116 TCP 287 [TCP Retransmission] 2379 β†’ 57118 [PSH, ACK] Seq=2354558 Ack=214121 Win=881 Len=221 TSval=558139462 TSecr=1485612713 [14:30:23] 736868 129.336803710 10.64.48.225 β†’ 10.2.2.8 TCP 136 [TCP Retransmission] 54154 β†’ 6443 [PSH, ACK] Seq=49145 Ack=757522 Win=1244160 Len=70 TSval=2714440602 TSecr=307014078 [14:30:36] so that looks like traffic to *other* k8s api servers and also from etcd that is being retransmitted [14:30:39] so kinda everything [14:31:23] adding kubemaster1001 to the pool once more and things are a bit more balanced [14:31:39] the roll restart errored out though [14:31:41] I don't have much context here but I wanted to share https://phabricator.wikimedia.org/P63443. can these be related in any way? these changes were done on netbox but not committed [14:31:52] Error: UPGRADE FAILED: release canary failed, and has been rolled back due to atomic being set: unexpected error when reading response body. Please retry. Original error: http2: client connection lost [14:31:58] sukhe: doubt it [14:32:00] sukhe: unrelated [14:32:02] ok [14:32:28] nic of kubemaster1001 maxed out again [14:32:34] sukhe: this is part of the host renaming work on codfw [14:32:44] thank you very much though [14:32:45] noted [14:33:15] so, even without the CPU thing, this is worrying [14:33:44] if we are maxing out nics, that's not good [14:35:25] kubemaster1002 is running on ganeti1010, one of the old nodes which only has a 1G interface, we could migrate it to one of the more recent nodes with 10G? [14:35:36] I 'll add another 4 vcpus to kubemaster1001 just to be consistent [14:36:09] moritzm: I suppose so, although we probably still want to explain why suddenly we went from 40MB/s to 120+MB/s [14:36:19] let me do both and see how we fare [14:37:25] !log repool kubemaster1001 with 8 vpus for consistency with kubemaster1002. [14:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] !log reboot kubemaster1001 with 8 vpus for consistency with kubemaster1002. [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:50] ack. ganeti1037 is also in row C, has a 10G NIC and sufficient free RAm [14:37:58] I don't know if we are having retransmits because we're saturating the interface or we're saturating the interface with retransmits [14:38:20] moritzm: cool, I was about to ask, thanks for the host names :-) [14:39:05] or ganeti1028 would also work [14:39:31] cdanis: that ^ btw explains why kubemaster1001 was able to reach 160MB/s [14:39:41] it's on ganeti1030 and ganeti1030 has 10Gbps nic [14:40:01] hah [14:40:09] that's right, we did that for the ganeti machines at some point [14:40:14] effie: akosiaris: https://grafana-rw.wikimedia.org/goto/CJ8JxmsIg?orgId=1 [14:40:45] I think forthcoming refresh of Ganeti in eqiad/codfw next FY will replace the remaining legacy servers with 1G [14:40:54] interestingly there have been some similar network spikes in codfw as well? [14:41:10] https://grafana-rw.wikimedia.org/goto/O1_-bisIR?orgId=1 [14:42:10] so I guess the codfw spikes were deploys https://grafana-rw.wikimedia.org/goto/Nx1LxmySg?orgId=1 [14:44:31] !log gnt-instance replace-disks for kubemaster1002, set ganeti1037 as a secondary [14:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] ~10m for this one [14:45:33] still, all we are doing is bumping limits up to now [14:45:54] not explaining what has happened and suddenly we needed 2.5 times more bandwidth [14:45:56] I still don't know whether the retransmits and saturation are causes or effects [14:45:58] or CPU [14:46:08] I think we're just saturating on everything and we were close to saturating before [14:46:18] and adding more hosts to the control plane means we're saturating earlier [14:46:20] akosiaris: still, should we just finish up adding 1002 ? [14:46:30] https://grafana-rw.wikimedia.org/goto/91LjxmsSR?orgId=1 [14:46:39] the high network utilization isn't new, it just usually was lower and/or shorter [14:46:40] they are proper machines and we wont have to play resource wakamole [14:46:55] I think we should validate that the new hosts work ASAP and take the old hosts out of rotation [14:46:56] effie: I 'd rather we kept variables stable while we debug this [14:47:18] at least until we have raised the cap on the net saturation too and see if those 2 were the issues [14:47:32] I think that hypothesis makes sense [14:47:35] akosiaris: but what if we want to validate the hypothesis that having both the old hosts and the new hosts be part of the control plane is the triggering issue [14:47:39] effie: what's left regarding 1002 btw? [14:47:45] and 1003 for that matter? [14:47:54] I see 1003 isn't even an etcd node yet, is that correct? [14:48:11] akosiaris: bgp stuff and calling it a master noe [14:48:20] 1003 has puppet disabled [14:48:36] effie: calling it a master node? [14:48:48] akosiaris: labeling it [14:48:59] kubemaster1001 is the one with the 1G NIC in hardware right? [14:49:02] can we just ... turn it off [14:49:06] +1 [14:49:10] cdanis: we can, but I am trying to understand what that validation would prove [14:49:18] cdanis: 1002 and I am already migrating it away [14:49:35] I need like 6m and it will be on a 10G host [14:50:02] Tue May 28 14:49:18 2024 - INFO: - device disk/0: 46.10% done, 6m 6s remaining (estimated) [14:50:39] my main worry is that we rush pushing forward to the new master, find out something's borked and will not have a way back fwiw [14:50:40] akosiaris: I am far from sure that the NIC is the limiting factor [14:51:18] kamila_: it wasn't earlier for sure. Something has changed and suddenly it is [14:51:28] at least it's the trigger of those retransmits [14:51:34] cause kubemaster1001 doesn't have them [14:51:37] and it's on a 10G host [14:51:45] so at least we rule out 1 thing [14:51:54] "we added more load and hit the limit" seems like a plausible hypothesis to me [14:51:56] now what on earth is the trigger for all of this, I don't know [14:52:14] ok, let's see, thanks akosiaris [14:52:18] we are talking here x2.5 more traffic [14:52:48] and funnily enough, etcd shouldn't be playing a factor here (at least per configuration of apiserver) [14:53:15] cdanis: does your tshark have the IP addresses of wikikube-ctrl100{1,2} in them btw? [14:53:19] akosiaris: yes [14:53:22] especially for port 2379 [14:53:23] 2.5x more traffic and we added new hosts [14:53:23] ? [14:53:26] yes [14:53:59] btw akosiaris CPU saturation during rollouts was an issue before too https://grafana.wikimedia.org/goto/oPVFamsSR?orgId=1 [14:55:16] is someone free to take over keeping the incident doc up to date? my RSI is complaining :-/ [14:55:27] kamila_: yeah I'll take it [14:55:32] thank you <3 [14:55:36] cdanis: yes on kubemaster1002 talking to port 2379/tcp wikikube-ctrl100{1,2} ? [14:56:09] hm [14:56:10] akosiaris: no [14:56:16] πŸ’™cdanis@kubemaster1002.eqiad.wmnet ~ πŸ•šβ˜• sudo ss -t | grep -i 2379 | awk '{print $4}' | cut -d: -f1 | sort | uniq -c [14:56:18] 247 10.64.32.116 [14:56:23] Tue May 28 14:55:18 2024 - INFO: Instance kubemaster1002.eqiad.wmnet's disks are in sync [14:56:26] it's all talking to itself (kubemaster1002) [14:56:30] ok moving over and testing again [14:56:58] !log migrate kubemaster1002 to ganeti1037 [14:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:19] Tue May 28 15:00:06 2024 * memory transfer progress: 101.92 % [15:00:23] ok, never seen that before [15:00:28] hah [15:00:36] Tue May 28 15:00:28 2024 * memory transfer progress: 113.26 % [15:00:38] oh that's nice [15:00:44] insert red october meme [15:01:20] I know what this is btw [15:01:30] or at least I think I do [15:02:20] kubemaster1001 just saturated [15:02:23] briefly [15:03:04] all I did was to restart kube-apiserver on kubemaster1002 fwiw [15:03:20] ok so just startup then probably [15:03:21] which adds credence to your theory that we were maxing out on both [15:03:54] VM migration done, I 'll start up the test shortly [15:05:04] akosiaris: uhhh page? [15:05:14] akosiaris: while it is not the problem obviously, we do have etcd nodes contantly talking first to v6 to each other [15:05:37] kubemaster1001 maxed out on CPU [15:05:39] hi, still reading scrollback [15:05:46] and is spewing about 2Gbps of network tx [15:05:58] cdanis: kubemaster1001 CPU saturating. at 8 vcpus [15:06:06] kubemaster1002 is faring ok-ish [15:06:47] akosiaris: wikikube-ctrl1001,2 are each receiving about 500Mbit/s or more [15:07:01] from who? [15:07:07] I'm guessing from kubemaster1001 [15:07:25] how? it's not even configured to use them [15:07:38] it would make sense if it was the other 3 kubetcds [15:07:45] and it was a misconfiguration issue [15:07:48] Acquire Time: 2024-05-28T12:38:18.055302Z [15:07:51] Holder Identity: wikikube-ctrl1001_68459a11-c7dd-407b-9041-02f1c9d6bacc [15:07:53] are you sure [15:08:00] the kube-scheduler lease is owned by wikikube-ctrl1001 [15:08:08] so is kube-controller-manager [15:08:22] ok, that one makes sense [15:08:34] but 1002 shouldn't receive traffic then [15:08:54] it's still part of the etcd cluster isn't it? [15:09:15] I mean receive traffic from kubemasters [15:09:33] but > it would make sense if it was the other 3 kubetcds [15:09:35] definitely applies [15:10:05] akosiaris: can you roll-restart again perhaps [15:10:36] yes [15:10:46] but at least I don't see retransmits this time around [15:10:50] so that's 1 issue solved [15:10:57] heh [15:11:01] so that was network saturation then [15:11:03] another N to go [15:11:05] NIC saturation, rather [15:11:15] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes&viewPanel=8&from=now-3h&to=now [15:11:29] yeah, kubemaster1001 reached 180MB/s this time around [15:11:32] yeah [15:11:47] close to 1.5Gbps, so at least we solve that weird one [15:11:55] okay, I think I'm caught up, hello [15:11:57] need any more hands? [15:11:59] triggering another rolling restart [15:12:08] good morning rzl [15:12:12] rzl: did you have coffee? [15:12:20] if no, go have one, we are on it [15:12:22] 692 kube /usr/bin/kube-apiserver ens13 65259.266 8493.538 KB/sec [15:12:38] [15:12:43] drinking it as we speak <3 appreciate it though [15:12:43] PID USER PROGRAM DEV SENT RECEIVED [15:12:48] 1188264 kube /usr/bin/kube-apiserver eno1 36571.223 584.710 KB/sec [15:12:54] 1082374 etcd /usr/bin/etcd eno1 1226.272 3045.038 KB/sec [15:13:00] that's from wikikube-ctrl1001 [15:13:17] morning rzl o/ [15:14:11] \o [15:16:06] it's literally all of them https://grafana.wikimedia.org/goto/giw-BiyIR?orgId=1 [15:16:10] all of them have a huge TX spike when this happens [15:16:16] it's just hugest on kubemaster1001 [15:18:36] and it was mostly kube-apiserver [15:20:15] so, kubemaster1001 kinda saturated again, but not so badly as previously [15:20:20] cpu wise [15:20:25] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=kubemaster1001&var-datasource=thanos&var-cluster=kubernetes&from=now-1h&to=now [15:20:36] I wanna try a full scap, anyone against? [15:21:34] sure [15:21:54] I am thinking [15:23:37] let me actually fix 1 small thing that is bugging me first [15:24:24] akosiaris: lets have a go and see, we may need a deployer around as there are changes that were not backported [15:24:33] give me shout to release the lock [15:24:56] will do [15:25:49] effie: I'm around [15:25:58] Amir1: <3 [15:27:55] restarting kube-apiserver these days takes absurdly too long [15:28:10] it's very noticeable in the CPU utilization [15:30:00] running 1 last test. My change is that I put in kube-apiservers the names of the other 2 etcds [15:30:10] akosiaris: lmk if there's anything I should be updating in the timeline :) [15:30:23] (oh, I should say in here too, I've grabbed IC from kamila_) [15:30:42] (that is a very good point, thanks :D) [15:31:27] akosiaris: ? [15:31:32] and done [15:31:51] do you mean there is something that we should have done [15:31:52] ? [15:32:36] no, my test is done [15:33:12] and now I am trying to make sense out of https://grafana.wikimedia.org/goto/giw-BiyIR?orgId=1 [15:33:32] interestingly, the pattern has shifted [15:33:50] not hugely but somewhat [15:34:33] it's more balanced now [15:35:36] akosiaris: I still suggest we finish up at least the 2nd master [15:38:30] host overview btw CPU graph panel has a resolution of 1/3 [15:38:34] it's ... misleading? [15:38:42] https://i.imgur.com/Kr4nEt1.png [15:38:51] this is from kubemaster1001 while it was saturating [15:39:00] TCP streams sorted by bytes of traffic [15:40:07] effie: unlock the global scap lock please [15:40:56] done [15:41:46] thanks [15:41:57] Amir1: any chance you wanna finish that backport? [15:42:07] sure [15:42:12] I am around for 1 more hour btw [15:45:57] hnowlan: I got this deploying via scap 15:45:00 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-28-154407-publish (ran as mwdeploy@kubernetes2032.codfw.wmnet) returned [255]: ssh: Could not resolve hostname kubernetes2032.codfw.wmnet: Name or service not known [15:46:28] hnowlan: ^ [15:47:22] where is that list being pulled from? [15:47:30] that rename is complete, shouldn't be anything hanging around [15:47:51] maybe it's still in puppetdb and thus dsh? [15:48:07] it's that hack we do to prepopulate images on nodes? [15:48:25] https://puppetboard.wikimedia.org/node/kubernetes2032.codfw.wmnet [15:48:29] puppet seems to be failing here [15:48:38] deploy1002:/etc/dsh/group$ grep kubernetes2032 kubernetes-workers [15:48:38] kubernetes2032.codfw.wmnet [15:48:39] yup [15:48:47] akosiaris: I am a bit concerned heh https://grafana.wikimedia.org/goto/rI2TPisIg?orgId=1 [15:49:14] ouch [15:49:27] spiking to 100 MByte/s across the whole daemonset [15:49:46] and this is one thing that has changed from last week to the next [15:50:56] indeed [15:51:56] sukhe: I think hnowlan is working on that host (with the rename cookbook), so probably semi-expected [15:52:04] RESOLVED: [4x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - [15:52:11] ok, that's 1 good thing that happened [15:52:13] heh [15:52:41] so, the masters are nodes in our setup currently [15:52:51] so they run those opentelemetry daemonsets [15:53:33] sukhe: I've deactivated that node [15:54:00] thanks [15:54:16] (was just sharing in case it was missed) [15:55:14] Amir1: how is it going? [15:55:18] the newly reimaged wikikube-worker2002 is ready to go- any issues with me pooling/uncordoning it given the current events? [15:55:28] akosiaris: no issues on my side [15:55:34] 85% done [15:55:58] kubemaster1001 TX is still growing lol [15:57:33] akosiaris: any objections to me stopping otelcol in eqiad right now? [15:57:34] akosiaris: chris will undeploy the opentelemetry daemonset [15:57:51] cdanis: none on my side [15:58:25] it's not running btw on kubemaster1001 from what I see [15:58:29] akosiaris: hey [15:58:31] do you remember how [15:58:38] we had to hardcode the master IPs [15:58:42] in the otelcol values file [15:58:50] GUESS WHICH TWO ARE LISTED FOR EQIAD [15:58:57] ahahahaha [15:58:59] # eqiad k8s API endpoint IPs [15:59:01] - ipBlock: [15:59:03] cidr: 10.64.0.117/32 [15:59:05] - ipBlock: [15:59:07] cidr: 10.64.32.116/32 [15:59:09] the ONLY two [15:59:11] that it can talk to [15:59:24] hooooly moooly [15:59:27] ahahahaha [16:00:19] so, maybe that functionality talks to the k8s API too much? [16:00:32] btw, that explains the CPU exhaustion too! [16:00:35] it *shouldn't* [16:00:44] suddenly there another 200 clients asking for stuff [16:00:55] yes [16:01:02] ok I've undeployed it [16:01:06] DELETED RELEASES: [16:01:09] NAME [16:01:11] main-opentelemetry-collector [16:01:13] ] [16:01:30] ofc I did that once things had already recovered [16:01:36] because I was gawking at that part of the values file [16:01:49] akosiaris: scap run or a rolling restart ? [16:01:58] ? [16:02:08] effie: Amir is doing a scap deploy [16:02:10] we want to trigger again and see what happens [16:02:12] I think it finished akosiaris [16:02:16] ah perfect [16:10:44] I think it won't conflict with anything so I'll be putting the newly renamed wikikube-worker2002 into service [16:11:50] kubemaster1001 is now saturating on CPU again [16:11:53] so it wasn't *just* otelcol, akosiaris [16:12:00] hnowlan: yeah go ahead, thanks for mentioning [16:12:34] a few tcp retransmits as well (although not at the old value) [16:13:26] effie: https://grafana.wikimedia.org/goto/rI2TPisIg?orgId=1 [16:16:18] and kubemaster1001 saturated on CPU for like a full two minutes during this rollout [16:16:21] which is better than last time [16:33:04] as things stand, will it make things worse/more complicated to add another colocated -ctrl node? [16:34:26] that's the big question -- I know c.danis has some thoughts when they get back from a quick break [16:35:14] "remove all the legacy nodes and hard cut all the way over to colocated ones" has a compelling case for taking all the pressure off, and also a compelling case for being fairly scary to do :) [16:37:33] thinking ahead, there's a train deploy in ~90 minutes so we should plan to be in a semistable state then, if we can -- right now, scap puts load on the kubemasters but not so badly that we either get paged or scap fails, so the situation is, what, livable [16:37:46] and also obviously thinking less far ahead, it's getting late in the day for both you and akosiaris :) [16:39:52] true [16:41:26] I'm definitely still open to making more moves like that, but we should start thinking about what kind of state we're comfortable leaving overnight [16:43:02] my 2 cents says that since last scap deploys was successful, we leave it like this for now and pick it up again tomorrow EU morning [16:45:49] afk briefly to grab something to eat before my next meeting block starts, but I'll have one eye on IRC [16:50:24] akosiaris: that seems fine, I'm also fine leaving otelcol disabled today, it was responsible for a lot of the load, but maybe half at most [16:50:40] (I only disabled it in eqiad) [16:50:44] do we have a phab task for all this nonsense? [16:51:34] also akosiaris I think I understand why the load was going back and forth between the legacy machines [16:51:46] I'm guessing most k8s api clients that want to watch for changes connect to just one IP and reuse channels [16:52:42] oh hnowlan dumb question -- wikikube-ctrl1003's IP address is 10.64.32.37 and it will be staying that, correct ? [16:52:50] cdanis: yep! [16:52:53] great ty [16:53:27] cdanis: yeah, I am guessing long live HTTP connections too [16:53:39] akosiaris: the API call it makes is something like `GetAndWatch` so yeah [16:53:39] especially those using WATCH [16:53:51] so that kinda checks out too [16:53:53] I'll make a ticket for keeping everything together [16:54:11] thanks I'll prep some small patches [16:55:02] tx 264 MB/s 19.9 MB/s 201 kB/s [16:55:12] that's from kubemaster1001 in the last deploy [16:55:31] is that a total, akosiaris ? [16:55:31] I never thought I 'd see that much traffic from an api server [16:55:41] cdanis: it's from https://grafana.wikimedia.org/d/000000607/cluster-overview?from=now-3h&to=now&var-server=kubemaster1002&var-datasource=thanos&var-cluster=kubernetes&orgId=1&viewPanel=437 [16:55:52] I've been looking at https://grafana.wikimedia.org/goto/0BAO_iyIg?orgId=1 [16:56:20] so, sum(rate(node_network_transmit_bytes_total{job="node",instance=~"$instance:.*",device!~"lo"}[5m])) [16:56:31] yeah that's basically that, plus uses irate instead of rate [16:56:44] sum by (instance) (irate(node_network_receive_bytes_total{job="node",cluster="kubernetes",instance=~".*(master|ctrl).*:.*",device!~"lo",site="eqiad"}[5m])) / 1e6 [16:56:47] yeah, irate gives an even higher max [16:56:49] 272 [16:56:51] lol [16:56:53] mhm [16:57:02] even without otelcol it's still >100 [16:57:25] and the wikikube-ctrl hosts are quite high on tx as well honestly [16:57:29] there's a few layers to this one for sure [16:57:54] yeah, 1 is etcd, another one is the kube-scheduler and kube-controller-manager masters [16:58:09] there's 1 that is the webhooks, which go to pods running in the cluster [16:58:23] https://phabricator.wikimedia.org/T366094 just added some basics [16:58:40] thanks Hugh! [16:59:45] btw, the new colocated etcd/master machines are 1G IIRC [16:59:51] lol [16:59:58] cause they aren't really new [17:00:02] reused hardware [17:00:29] I would love to know how we are sending 200MiB/s of grpc [17:00:32] this is gprc right? [17:00:32] I don't think we expected a network saturation when we said "this 3 mw boxes go for etcd control plane cause memory and CPU" [17:00:40] even if it is json protobufs over http, wtf [17:00:58] some might be grpc but not all of it for sure [17:01:01] there must be something pathological happening here though [17:03:36] standing down as IC for now since it sounds like we're done with the immediate response for the moment, but I'll still be around [17:03:54] I have a few more things to touch up in the doc, which I'll do gradually in between meetings today [17:04:10] other hands on that would be great too :) [17:04:31] I'll try to spruce up things where I can [17:04:40] hnowlan: when last week were the codfw nodes added? [17:04:42] just to clarify - these issues didn't block deploys other than Tim's initial one? [17:04:53] there were two I believe [17:04:55] and I think Lucas_WMDE's later attempt [17:05:00] yeah [17:05:04] we have the logs from scap in a phaste for that one [17:05:11] cdanis: 2024-05-22 between 10:00 and 11:00 [17:05:24] ok so also before the otelcol change [17:05:42] yeah the relevant IPs are in the config in your initial change [17:05:46] yeah :) [17:06:54] wikikube-ctrl1002 was still doing like 80MByte/s of network TX btw [17:07:06] I don't know if that was just etcd or what? [17:08:21] o_o wow [17:08:31] during the deploys? [17:09:54] yeah [17:09:59] https://grafana.wikimedia.org/goto/0BAO_iyIg?orgId=1 [17:10:22] it was actually doing slightly *more* network TX than wikikube-ctrl1001 [17:10:29] during the last two [17:10:36] so I don't know what that means exactly [17:16:30] I can't think of anything else it could be other than etcd if it's cordoned [17:17:17] hnowlan: so it's shown in the output of `kubectl get endpoints kubernetes -oyaml` [17:17:23] interesting ... 80 MB/s is quite a bit more than the etcd grpc metrics would suggest (combined -> client and -> peer tx) [17:17:40] swfrench-wmf: yeah when I looked at the hosts, it was mostly kube-apiserver doing the talking [17:17:49] but I didn't look at wikikube-ctrl1002 specifically [17:17:56] well, I guess I'll have a chance during the scap deploy heh [17:19:12] I guess it could be getting api calls regardless, just not scheduling [17:19:36] which I guess brings us back to the original Then What Did That question [17:20:11] kube-apiserver is running there and has consumed an hour of CPU time since it was started almost five hours ago [17:20:25] so, I think it's safe to say that kube-apiserver on wikikube-ctrl1002 has been getting traffic [17:20:59] I'll get another pcap during the scap [17:23:18] dear people, can I deploy something now? [17:23:54] Amir1: please [17:24:22] hnowlan, cdans: wikikube-ctrl1002 does show API traffic here https://grafana.wikimedia.org/goto/iSZ3umyIR?orgId=1 [17:24:44] starting at 12:39 and then jumping again at 14:10 [17:27:06] hmm... not a very high rate unless I'm misjudging this graph [17:27:08] oh that's very useful thanks rzl [17:27:24] compared to wikikube-ctrl1001 [17:27:24] no, much lower than the other servers if you remove the instance= to see them all [17:27:26] yeah [17:27:37] I am suspicious of the LIST calls [17:27:45] well, not shown on that graph, sorry [17:27:51] if you also include kubemaster1001 [17:27:56] https://grafana.wikimedia.org/goto/kmF7risIg?orgId=1 [17:31:35] hey I wonder if we have sflow data for the earlier events now [17:37:56] is there something special about kubernetes1048? it shows prominently in apiserver sflow traffic [17:38:58] is 10.67.x.y always a pod IP? [17:40:41] no, but 10.67.128.0/18 is [17:40:47] per https://netbox.wikimedia.org/ipam/prefixes/378/prefixes/ [17:41:04] telling the I/F sub-team member to check netbox πŸ’€ [17:41:18] you can imagine me doing v.olans's voice if it helps any [17:41:24] anyway that all checks out with https://w.wiki/ADNj [17:42:27] I gotta head off for the evening, but I'll be back around in a few hours if needed [17:42:30] gluck [17:42:41] hnowlan: I think we're just not gonna touch anything else until EU tomorrow [17:42:54] thanks for all the work hnowlan, have a good night [17:44:25] rzl: do you have a "rolling restart 10 pods" command handy [17:45:37] I do not, sorry [17:45:53] I can see if I can dig one up but probably not faster than you [18:09:12] a kube-apiserver is really just a way of doing etcd->client fanout huh [18:34:33] rzl: https://docs.google.com/spreadsheets/d/1BDGubGYYSIGOQh6YF42bddKpGiMKS_Uc0ltW8n47Ye8/edit?usp=sharing [18:35:02] (in a meeting but half-looking) [18:35:08] this is wikikube-ctrl1002 traffic during one of the spikes during a rollout [18:35:14] just a random sample of 500k packets [18:35:27] which is about 1.3Gbyte [18:35:36] it's all just like ... normal looking stuff [18:36:54] unfortunately whatever was around on the node IPs themselves is no longer around [18:37:14] at least of the 3 i've spot checked so far [18:42:29] it looks like all of these connect to the API on a node: kubelet, kube-proxy, rsyslogd