[08:30:28] hi folks, upgrading DSE to 1.23 [11:09:13] XioNoX: o/ [11:09:36] elukey: YO! [11:09:39] yo [11:09:51] we have an interesting issue with reimaging and nodes in row E/F (dse-k8s-worker100[5-8]) [11:10:03] interesting? :) [11:10:19] Alex checked and it seems that ipv4 doesn't work, but only ipv6 does [11:10:43] XioNoX: cumin1001:~$ sudo ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /root/.ssh/new_install root@dse-k8s-worker1005.eqiad.wmnet [11:10:48] and try to ping the ipv4 default gw [11:10:54] IPv6 seems to work fine up to now [11:11:09] checking [11:11:31] it does have an arp entry for the default gw [11:11:57] ah wait we had a lot of issues with ARP and hosts in E/F [11:12:11] we did, but this doesn't appear to be arp related [11:12:19] or at least, not yet [11:12:45] ip neigh show returns what I would expect in a fully functioning machine [11:14:29] only the gateway too, other hosts in the same vlan seems fine [11:17:00] I think I remember something a bit similar [11:17:03] trying to find the task [11:18:13] https://phabricator.wikimedia.org/T306421 [11:21:55] yeah that solved it [11:21:58] shall we try the workaround? [11:22:01] ah perfect :D [11:22:11] in theory the re-image cookbook have the workaround implemented [11:22:30] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L423 [11:22:45] did the re-image happened differently this time? [11:23:54] not really, I used the reimage cookbook [11:24:21] weird [11:24:30] does it happen before/after the first puppet run? [11:24:56] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L574 [11:24:59] IIUC it is after [11:25:07] so this may explains it [11:25:26] we'd need to do it (maybe?) before puppet runs? [11:26:02] (only clear ethernet-switching mac-ip-table I mean) [11:27:36] possibly, yeah, it should be right after anything DHCP is done [11:29:59] volans|off: https://phabricator.wikimedia.org/T306421#8643842 wink wink [11:30:23] elukey: can you take care of the other stuck hosts? I'm about to go to lunch [11:32:09] XioNoX: sure but no idea what I have to do :D [11:32:38] ah I see the task okok [11:32:51] I am going to lunch as well, will kick off the reimages later on [11:32:56] (the hosts are downtimed) [11:33:00] cc: btullis: --^ [11:35:03] Ack, following along, thanks. [14:23:55] cleared all the nodes, restarted the reimages :) [14:34:04] and they are all failing sigh :( [14:34:05] https://phabricator.wikimedia.org/T330261#8644349 [14:34:12] XioNoX: --^ (if you have time) [14:34:30] before starting them I checked via install console that puppet ran correctly [14:34:38] *starting the reimage for them [14:37:27] yeah looks similar [14:39:21] elukey: it's working again now [14:39:33] I cleared it, etc [14:42:40] XioNoX: sure but I had done the same just before the reimage [14:43:23] (both commands that you indicated in the task) [14:43:31] I can retry if you want but it looks weird [14:43:33] elukey: yeah it's because it's the DHCP query during the re-image that triggers the bug [14:43:55] ah ok so until we fix the cookbook we are doomed to fail [14:44:38] or we need to run the clear commands manually with the good timing [14:44:52] a bit after dhcp is not needed anymore [14:45:09] nah let's use the hosts to test the fix when it will be ready [14:45:25] dunno how the cookbook behaves, does it halt and wait for some user input? [14:45:30] or just fails? [14:45:50] it just fails [14:49:26] ok [14:51:29] correcting myself - in two cases it failed, in other two cases it re-appered the original bug (so it asked for a retry) [14:51:58] btullis: I've set 7 days of downtime for dse-k8s-workers100[5-8] and set the task to stalled, but the rest of the cluster works if you want to experiment :) [14:52:34] elukey: Ack, thanks. [15:09:08] elukey: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/891848 [15:15:15] XioNoX: I see yes.. so at that stage in theory puppet and facter should be there right? And yes we'd need to figure out if it works or not.. Since it seems only something for row E/F and that the cookbook is already broken for this use case, I'd be inclined to merge/test, but maybe somebody from Infra foundations should +1 [15:15:53] elukey: and it's friday evening :) [15:16:02] that too [15:16:06] I think that we can wait for Monday [15:16:23] fine to merge now if we get more +1s [15:16:42] at least this works: root@dse-k8s-worker1006:~# facter -p networking.interfaces.eno1np0.mac [15:16:42] 5c:6f:69:28:81:8a [16:31:27] heading out, have a good weekend folks!