[09:01:05] hello folks [09:01:17] kubestagetcd* nodes migrated to PKI correctly afaics [09:09:55] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 (10elukey) ` elukey@kubestagetcd1004:~$ etcdctl -C https://$(hostname -f):2379 cluster-health mem... [09:10:18] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Switch wikikube-staging (codfw and eqiad) etcd clusters to use PKI - https://phabricator.wikimedia.org/T329717 (10elukey) [10:00:03] 10serviceops, 10Wikimedia-Developer-Portal, 10Goal, 10Patch-For-Review, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10akosiaris) 05Open→03Resolved a:03akosiaris I am gonna resolve this, apparently there isn't likely much telemetr... [10:01:15] Heyo o/ [10:34:41] o/ [10:43:33] akosiaris: o/ do you recall if there were issues with hosts in row E/F for k8s 1.23? [10:43:44] I am trying to debug why puppet fails only on dse nodes with 1.23 [10:43:50] and in row E/F [10:46:06] not that I am aware of [10:46:13] what do you see? [10:49:10] so the reimage cookbook fails to run puppet, but I don't see detailed logs because probably the cookbook needs to finish first? [10:49:29] anyway, an example is https://puppetboard.wikimedia.org/node/dse-k8s-worker1005.eqiad.wmnet [10:49:45] can't log as root as well in the mgmt console [10:50:18] I recall Janis saying something about row E/F but I can't recall now [10:50:40] BGP stuff being different? [10:51:55] could be but I'd have expected calico to complain in that case [10:52:13] Fair [10:52:13] I'll abort one of the reimages to see if I can get logs on cumin1001 [10:59:30] if it is in the installation phase you can get root to the install env [10:59:50] and get logs from the d-i [11:00:05] I wonder whether it even gets an IP [11:00:28] oh wait, you said puppet fails, not the installation [11:00:34] exactly yes [11:00:46] but install_console, root login, etc.. don't work [11:02:09] cumin1001:~$ sudo ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /root/.ssh/new_install root@dse-k8s-worker1005.eqiad.wmnet [11:02:12] I just got access [11:02:34] wow I didn't know this trick :D [11:02:53] running puppet [11:03:18] Warning: Unable to fetch my node definition, but the agent run will continue: [11:03:18] Warning: Failed to open TCP connection to puppet:8140 (getaddrinfo: Temporary failure in name resolution) [11:03:20] there we go [11:03:25] what on earth... [11:03:40] lovely... [11:04:49] can't ping the DNS server [11:05:22] can't even ping the bastion host [11:05:30] how on earth did I managed to login ? [11:05:36] Ok is that new_install ssh key trick documented? [11:05:39] Because it needs to be [11:06:11] I'm surprised install-console doesn't work, it runs the exact same command [11:06:12] it definitely was in the past [11:06:29] akosiaris: amazingly if I try to ssh from my machine to dse-k8s-worker1005.eqiad.wmnet [11:06:31] ❯ ssh dse-k8s-worker1005.eqiad.wmnet [11:06:33] cgoubert@dse-k8s-worker1005.eqiad.wmnet's password: [11:06:45] yeah, what is going on ? [11:06:52] ¯\_(ツ)_/¯ [11:06:55] icmp/udp not working but SSH working ? [11:07:17] akosiaris: have you checked ferm/iptables? [11:07:32] ah wait, I am logged in over IPv6 [11:07:40] so.. IPv6 works? but IPv4 no ? [11:07:47] and yes, ping over icmpv6 works [11:08:00] ok this is a very good clue [11:08:10] should we ping Arzhel/Cathal? [11:08:22] claime: I can't even ping the default gw... I doubt firewalling is an issue [11:08:30] yeah [11:08:39] akosiaris: yeah it was a reflex question, soz [11:08:39] this needs network eyes I think [11:08:53] I'll ping them on the k8s chan :) [11:09:33] make sure they don't end up thinking it's k8s related, cause it isn't [11:15:41] claime: iptables -nxvL [11:15:41] -bash: iptables: command not found [11:16:00] just making sure to go down that path too, so that we don't meet any surprises [11:16:03] akosiaris: That's good. [11:16:06] lol [11:16:16] and then having me banging my head against a wall for not listenting to you [13:41:20] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [14:06:53] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) Hey everyone, Sorry for taking so long to respond, other stuff took priority >>! In... [14:42:42] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) I checked all translations regarding time and date. I had to fix all of them manually, at least the one... [14:58:17] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: etcd cluster reimage strategies to use with the K8s upgrade cookbook - https://phabricator.wikimedia.org/T330060 (10akosiaris) The experience above matches my own, when I have to add/remove no... [18:26:13] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Papaul) [18:42:47] 10serviceops, 10Performance-Team, 10Patch-For-Review: Rewrite mw-warmup.js in Python - https://phabricator.wikimedia.org/T288867 (10RLazarus) 05In progress→03Resolved [23:51:42] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)