[01:46:06] https://www.youtube.com/@toadOMG [01:46:09] Sub [01:46:12] OOPS [01:46:14] NVM [01:46:16] Actually [01:46:21] I'M EXPOSING SOMEONE! [01:46:25] CUZ HE BANNED ME [01:46:31] FROM WIKIPEDIA [01:46:32] https://www.youtube.com/@toadOMG [01:46:33] https://www.youtube.com/@toadOMG [01:46:33] https://www.youtube.com/@toadOMG [01:46:34] https://www.youtube.com/@toadOMG [01:46:34] https://www.youtube.com/@toadOMG [01:46:35] https://www.youtube.com/@toadOMG [01:46:35] https://www.youtube.com/@toadOMG [01:46:36] https://www.youtube.com/@toadOMG [01:46:37] https://www.youtube.com/@toadOMG [01:46:38] ^^^^^^ [01:46:40] https://www.youtube.com/@toadOMG [09:11:05] jbond: o/ [09:11:33] if you have time https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/891848 [09:14:31] * jbond looking [09:16:06] <3 [09:16:29] done [09:16:33] this is basically to test if the workaround can allow reimages of row E/F nodes [09:16:51] super! I'll merge and test, will report asap [09:18:09] sgtm [09:28:03] oh, I'm having issues trying to reimage a host on row E, (cloudcephosd1003) would that fix it? [09:28:54] dcaro: o/ it depends what is your issue! I got some weird problems when running puppet [09:29:03] what I'm seeing is that the host is rebooted using IPMI, then boots idrac, and console goes blank, ending in a timeout on the cookbook side [09:29:04] because network was broken basically [09:29:37] mmm it seems a little different [09:30:21] but it is worth a try :) the code is already merged [09:33:04] nice, I'll try again just in case [09:33:09] thanks! [09:34:03] dcaro: my run just failed :D [09:34:50] jbond: find if I merge spdx: update spdx new files to ignore files regardless of path (2e341eb5be) on puppet master? [09:35:33] oof the error is "facter -p networking.interfaces.eno1np0.mac" [09:35:45] so facter is not available (probably) at that stage [09:36:18] tricky one [09:36:25] chicken and egg kinda thing [09:38:37] reverting since it worsen the behavior of the cookbook [09:39:11] oops, spicerack.dhcp.DHCPError: target file ttyS1-115200/cloudcephosd1003.conf exists [09:39:20] might that be related? [09:40:13] nono it is different in theory [09:42:43] dcaro: just rm that file, it means that a previous cookbook run was killed -9 or ctrl+c twice and didn't had a chance to cleanup [09:43:12] volans: o/ if you have suggestions fro the issue above I am all ears.. [09:43:26] elukey: do you have your host still there after the reboot after d-i finished? [09:43:44] volans: didn't touch it after the cookbook failed [09:43:46] is that dse-k8s-worker1006.eqiad.wmnet ? [09:43:51] exactly yes [09:44:12] ok let me have a look [09:49:32] elukey: did you change anything on the host? [09:59:11] nice, this time I see something on the console, but it seems to fail to dhcp [10:00:29] I can see the dhcp file was created, maybe there's something needed to boot from cloudsw on E4? [10:00:37] jbond, volans: ^ do you know? [10:02:54] IIRC there was something special about those due to an issue with junos and the dhcp relay setting, but I'll defer to topranks for this [10:03:16] I think he might be on holidays, XioNoX might know though :) [10:03:32] did you check the dhcp logs on the install server and/or tcpdump it to see if the requests came to the install server? [10:04:21] https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues [10:04:22] nope, I can do [10:07:33] hmm... any easy way to get the mac address? xd [10:07:45] dhcp seems to be configured using the switch port instead [10:07:50] but the logs log by mac [10:08:46] ah, I see the DHCPACK with the correct ip 10.64.148.16 [10:09:33] but the server fails to get dhcp :/, looking [10:12:19] hey, late to the party [10:13:05] dcaro: I need to catch up on the issue elukey mentions about row e/f hosts, but that likely doesn’t relate to your issue if it’s connected to cloudsw1-e4 (different setup on cloudsw) [10:13:36] dcaro: is DHCP failing on PXEboot or during the Debian installer phase do you know? [10:14:06] topranks: I think it's debian installer this time (I see the fancy ui to configure network) [10:16:27] topranks: o/ [10:17:56] dcaro: most likely reason for something to fail at that stage (having sucessfully DHCP'd at PXEboot stage), is the NIC firmware [10:18:00] let me have a look [10:18:14] sounds reasonable yep [10:21:09] dcaro: yeah the firmware on the NIC is on 21.40.2, which I believe can cause issues [10:21:25] I think it needs to be upgraded as described here: [10:21:26] https://phabricator.wikimedia.org/T329498#8618752 [10:21:38] Otherwise the kernel in the debian installer fails to initialise the card properly [10:21:45] port is hard down so no DHCP is actually getting sent [10:22:11] jbond nfraison we've had the commits pending to merge for almost a hour now, can we please move forward? [10:22:17] ack, is that something I can do? or should I fw to dcops? [10:22:22] Just a/b steps under item 1. [10:23:06] DC-ops can handle it yep. If there is urgency you can do it too, should just be a matter of running the cookbook. [10:23:15] sudo cookbook sre.hardware.upgrade-firmware -n -c nic [10:23:18] marostegui find for me I'm just waiting ack from jbond to merge [10:23:29] elukey: hi! [10:23:36] topranks: thanks! I'll try myself first :) [10:23:50] cool :) [10:23:51] nfraison: Yeah, I don't think we can have the repo waiting for one hour. So I am going to go ahead and merge both, if something breaks I will revert both [10:23:53] nfraison: please do [10:24:11] volans: the work-around for the evpn switches in row e/f has been working well, we've not had reimage issues since then [10:24:16] fyi (and thanks again :)) [10:24:17] topranks: hi! Don't you like to start your week with some row E/F weirdness? :D [10:24:27] absolutely I do :) [10:24:31] aahaha :D [10:24:38] so I am planning to merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892377/ [10:24:41] and test it again [10:24:53] it is a variation of what Arzhel filed on Friday [10:24:55] topranks: yes I was wondering if there was still some special setting needed for the cloudsw, that was my doubt [10:25:05] sorry marostegui nfraison i must hav missed the ping [10:25:48] volans: cloudsw is a "layer 3" switch, with routing, some some similarities to the ones in e/f [10:26:08] but it's not running evpn/vxlan, and doesn't suffer from the issue we had there, so no additional command needed on it [10:26:33] elukey: ok yep, looking at it I'm not seeing why there may be an issue in e/f, but I take it there is some problem? [10:27:23] topranks: basically https://phabricator.wikimedia.org/T306421#8643842, I am trying a reimage again but last week I got stuck only with hosts in row E/F [10:28:14] ah ok yep, and the issue is no mac in facter at that point [10:28:45] I'm not sure we need to run that earlier [10:28:59] to explain quickly the issue is that by default the switch "snoops" on DHCP messages [10:29:24] and caches the returned IP address [10:30:07] if we do the clear later on puppet fails during its puppet run, ipv4 networking is broken (but ipv6 works) [10:30:20] Due to a bug in this version of JunOS (a 'won't fix') the switch deletes the MAC address from the forwarding table 10 hours after DHCP completes (at the time DHCP lease expires, by which stage we've changed to a static IP config on the host) [10:30:45] elukey: that makes sense, in that the issue is only with IPv4/DHCP assigned IP [10:31:30] what's confusing me is what has changed, previously hitting this issue things worked fine for 10 hours, then IPv4 stopped. [10:31:44] In theory we should have 10 hours to issue those commands on the swtich, with it working during that time [10:31:46] hmmm [10:34:25] so in terms of the patches I'm not entirely sure we need to run that command any earlier [10:35:19] elukey: I'd be interested to try and work out what happened the times it failed [10:35:35] IPv4 networking was broken right away on the host coming up after OS installation? [10:36:37] elukey: is there one of those hosts I can kick off a re-image of and do some checks to see what's going on? [10:36:42] or what is current status? [10:38:27] elukey: topranks: volans: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892378 [10:38:38] topranks: I kicked off a reimage of dse-k8s-worker1006 but the cookbook is failing to find if the host is in d-i or not, so probably something else happened [10:38:38] im pretty sure this attempt will fail [10:39:14] ah lovely [10:39:30] topranks: so ipv4 networking was not working for the first puppet run, d-i worked fine [10:40:41] jbond: ok thanks, not 100% on the requirement for that but take your word on the fix being needed there [10:41:12] topranks: if you want to check, dse-k8s-worker1006 is currently not working (i believe) due to the network issue outlined above [10:41:23] there is no requierment for the addtional switch really but -p and --no-custom-facts conflict wth each other and make facter refuse to run [10:41:29] it fails to check if the host is in d-i though [10:41:46] I believe that I should have cleared its status on the switch before attempting to reimage it [10:43:55] jbond: ok thanks for that [10:44:50] topranks: I can try to manually clear dse-k8s-worker1006's status on the switch and kick off the reimage cookbook again, so we can check [10:45:18] dse-k8s-worker1006 is giving me a login prompt on the console, but doesn't seem to want to accept root pw [10:46:37] I think it didn't reimage correctly, the cookbook failed to check if it was in d-i or not.. [10:46:43] https://www.irccloud.com/pastebin/c7e6nJXa/ [10:46:44] if you check install_console it doesn't work [10:46:47] topranks: ^ [10:47:33] dcaro: hmm, not an expert on that error, give me a sec I'll have a look [10:47:53] elukey: kicking off the reimage again is a good idea if you can, I'll take care of the switch and check status there [10:49:22] topranks: super, lemme know when the dhcp status is cleared so I'll kick it off again [10:49:25] dcaro: you are running the firmware upgrade cookbook? i can take a look at that which cumin host did you use [10:50:01] elukey: fi that facter change is merged and deployed [10:50:02] jbond: root@cumin1001:~# sudo cookbook sre.hardware.upgrade-firmware -n -c nic cloudcephosd1003 [10:50:08] cumin1001 [10:50:15] ack let me check the error [10:50:33] jbond: <3 [11:01:22] elukey: looking at the switch logs from last week I only see that you and Arzhel ran the 'clear' commands [11:01:42] i.e. I don't see evidence the reimage cookbook successfully ran that command at any point [11:04:11] topranks: correct yes.. I did two reimages this morning: with Arzhel's fix (to move the clear command earlier on), the facter command to retrieve the mac failed (for what reason it is unclear). The second time the issue appearted even before d-i, so the cookbook failed to check if d-i was running or not and asked if it was ok to keep going. [11:04:59] the latter I believe failed due to networking issues, since I tried install_console and it hanged [11:05:05] ok... let's retry again now. [11:05:19] all righ [11:05:21] *right [11:05:22] The cookbook now has the clear commands running at the same time as before any recent changes right? [11:05:25] just has John's fix? [11:06:42] it also runs earlier [11:07:33] let's change that first, I don't believe we should need to run it earlier [11:07:57] or at least I don't have a proper understanding as to why we might need, so let's set it back before we confirm any change is needed [11:08:00] ok I already kicked off the reimage :D [11:08:28] topranks: Arzhel's idea was to clear the dhcp/arp/etc.. state before the first puppet run happened [11:08:37] *happens [11:08:41] that doesn't sound bad [11:08:50] I can't see why that would be needed though [11:09:04] sure maybe we should do it earlier/later, but I can't see that being a problem as such [11:09:17] in my case last week it was the first puppet run failing [11:09:36] yeah I'd like to know what's going on there [11:09:40] elukey: just to clarify the timeline, the first reimage you run failed for this issue? [11:09:50] what I do know is the cookbook _did not_ run that command at any stage [11:09:53] early or late [11:09:58] I was wondering if maybe it failed before and the clear was not issues and in cascade caused all the subsequent issues [11:10:05] it'd show in the switch logs, but only thing there is you and Arzhel doing it manually [11:10:08] *issued [11:10:32] volans: possibly [11:10:48] The previous issue, to be clear, caused hosts to lose comms 10 hours after reimage [11:10:59] When original DHCP lease time expired [11:11:14] There was never any problems with comms to the host during reimage or shortly after (during puppet runs etc) [11:11:33] if that's the case we could run the clear also in the rollback [11:11:46] at least to make sure unrelated failures don't cause this issue [11:12:13] volans: hmm yep that's not a bad idea [11:12:34] either unconditionally or if-guarded by a self.rollback_clear variable to be set at the right time [11:12:38] that could cause some edge-case issues if it runs multiple times perhaps [11:14:07] volans: yes correct, it failed for all the dse nodes in row E/F. Then we discovered the issue about ipv4 networking, and Arzhel tried to fix one of them issuing the clear commands on the switch. I did the same for the rest of the nodes, and retried the reimages [11:14:11] that failed again, same reason [11:14:46] topranks: since the reimage is ongoing I'll wait to see if it fails or not, and then revert the clear() patch if needeed, would it be ok? [11:14:55] yes let's just see what state it lands in [11:15:20] my understanding is the timing of that shouldn't matter, not a big deal if it happens early [11:20:22] Ok so dse-k8s-worker1006 OS install completed, and it's now rebooted into debian [11:20:27] pingable from the switch [11:20:54] \o/ [11:22:13] again I can't login to console, rejects root pw? [11:22:43] SSH asks me for pw, but I assume that's expected, puppet will add user SSH keys? [11:23:14] and now I can't ping... hmm. [11:23:37] Oh I responded too early, sorry. [11:24:17] btullis: yeah, unfortunately [11:24:33] topranks: so the facter command seems to have failed [11:24:37] so it does seem this mac-ip table mismatch has happened, and a lot sooner than I'd have expected... [11:24:42] elukey: ok [11:24:58] so the cookbook failed as well :( [11:25:53] topranks: you can ssh with the install console keyfile from a cumin host [11:25:54] volans: no output that I can see from the facter command [11:26:23] topranks: sudo ssh -i /root/.ssh/new_install -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no dse-k8s-worker1006.eqiad.wmnet [11:26:36] the install-console wrapper forces -4, so doesn't work [11:26:43] elukey: ack [11:27:58] elukey: we should set print_progress_bars=False, print_output=False [11:28:02] volans: looking on install1004 my reading is the DHCP lease time is 10 mins? [11:28:03] to true to see what's happening [11:28:25] topranks: looking, teh automation didn't touch that [11:28:30] does that sound right? I can get a pcap to confirm. 10 mins is maybe short enough to trigger this during install [11:28:50] I see these lines, seem default as you say, but unsure what to expect: [11:28:55] dhcpd.conf.dpkg-dist:default-lease-time 600; [11:28:55] dhcpd.conf.dpkg-dist:max-lease-time 7200; [11:29:08] that file is not use [11:29:10] *used [11:30:10] but is probably the default value and AFAICT we don't set it [11:30:19] topranks: but when that should happen? [11:30:24] once d-i get's the IP [11:30:27] ok, don't see it elsewhere so probably is [11:30:27] it then staticize it [11:30:43] I'll try to confirm (we'll likely need to try this again :( ) [11:31:21] volans: yep once d-i gets it's IP the switch will cache the DCHP offer, when that expires it triggers this edge case of clearing another table and causing our problem [11:31:34] I *thought* last time it was a lot longer, but maybe I am remembering that incorrectly [11:31:41] elukey: the weird part is the 255 from cumin, it means ssh failed to connect, if you retry I'd set on line 419 print_output=True [11:32:06] we can totally increase the lease time if that helps [11:32:10] I don't see any drawback [11:32:30] it could be a race-condition based on how long the installer then takes [11:32:56] The only complication is that if the reimage cookbook issues the two clear commands correctly the problem should not occur, or at least should be corrected when they run [11:33:03] STOP [11:33:16] the _clear_dhcp_cache is written to work AFTER the puppet run [11:33:18] can't work as is [11:33:48] need to use self.remote_installer instead of self.remote_host [11:33:56] to connect with the installer ssh key [11:33:58] elukey: ^^^ [11:34:11] let me send a patch [11:34:14] volans: ok, hmm, I'm guessing it's AFTER, cos we want the MAC address for the 'clear' command [11:34:30] to summarize: [11:34:53] - if the _clear_dhcp_cache is called as before *after* puppet, we can revert all changes [11:35:08] - if the _clear_dhcp_cache needs to run *before* puppet, needs to use self.remote_installer instead of self.remote_host [11:35:22] depending on when you want to run it we can patch it accordingly [11:35:35] I'd be in favor of trying self.remote_installer [11:36:04] I wasn't aware of the distinction [11:36:04] * volans afk for ~5m [11:36:34] I need to run afk for a couple of hours, please reimage 1006 or the other nodes if you need to test, otherwise I'll pick it up later on :) [11:36:39] thanks all for the moment! [11:36:57] From the network point of view it has to run before the DHCP lease-time expires [11:37:07] But can be run at any time after the DHCP offer is issued during d-i [11:37:16] So no harm doing it earlier [11:40:06] topranks: do you mean that we could run it even before d-i finishes, as long as it started? [11:40:40] No it needs to run after d-i issues DHCP discover and gets the offer/ack back. [11:40:53] So during d-i, but only after it brings up the network [11:42:03] Increasing the lease-time may give us more flexibility here, if indeed the lease is only 10 min [11:44:45] elukey: Can I re-run the reimage cookbook to get a pcap of the DHCP messages? [11:52:34] I'm running now btw :P [11:52:51] hmmm [11:53:06] did I missing something? DNS look up of upload.wm.o in drmrs is sending me to eqiad [11:57:29] volans: so the lease time is 12 hours, shouldn't complicate / add a race condition we need to worry about [12:00:05] ack [12:05:38] vgutierrez: so, the hosts with public IP in drmrs resolve to drmrs, the ones with private IPs resolve to eqiad [12:05:57] volans: yeah.. I was testing that on cp6001 [12:06:01] (see cumin 'A:drmrs' 'dig +short upload.wikimedia.org' ) [12:06:12] I was used to be able to curl upload.wikimedia.org there and go to the localhost [12:06:46] esams is weirder, is a mix [12:07:37] ulsfo and eqsin are the same of drmrs [12:11:01] * volans lunch [12:48:55] topranks: the reimage cookbook now failed when running '/usr/bin/facter --no-custom-facts --no-external-facts networking.interfaces.ens2f0np0.mac', shouldn't it be using the newer interface names instead? (should I change the interface names in netbox manually?) [12:50:14] dcaro: ooh, that might be a factor in why we can't run facter for the other issue too :) [12:50:18] hadn't considered that [12:50:33] xd [12:51:18] now in your case, on cloudsw, we don't need to run that _clear_dhcp_cache() at all [12:51:38] but I expect it's running because the logic is applied based on anything in row E or row F [12:52:01] and it's harmless to do, except now with the interface rename it's causing a problem [12:52:28] let's wait a few mins till v.olans is back from lunch [12:52:54] I think we may want to change when that runs until after puppet, which should re-import names from host to Netbox [12:53:47] I think changing it in Netbox will solve your issue, but we should aim for something less manual [13:01:33] ack, I can wait :), np [13:01:39] elukey: were you updating these hosts from buster to bullseye? [13:13:29] out of curisity did you find `facter --no-custom-facts --no-external-facts networking.mac ` to be unstable. that should give the correct mac address (which ever interface currently has the default gw on it) cc volans [13:15:18] jbond: ah ok that may indeed help, if the interface renaming is complicating things (which I think it is in David's case) [13:20:49] sent the following cr but lets wait for volans before mergeing. however yu should be abl to test frim cumin2002 with the following command [13:20:51] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892438 [13:20:57] dcaro: fyi ^^ [13:21:05] wrong command [13:21:07] sudo cookbook -vvvv -c /home/jbond/cookbook.yaml sre.hosts.reimage [13:21:41] jbond: nice! let me try [13:22:36] jbond: nice, thanks. [13:23:15] volans: on the overall issue I'd like to move the running of the clear_dhcp_cache() back to after puppet runs [13:23:47] it's not an issue to run earlier - we just need to change to self.remote_installer() - but I'd rather understand what is going on properly [13:29:54] * volans back [13:31:09] is the theory that we should not rely on netbox and pick whatever facter returns? IIRC we had issues with the primary iface being wrongly returned by facter [13:31:17] in some corner cases [13:32:16] what's the current theory of why failed in the first place for luca? [13:35:21] volans: when you say we had issues with facter returning the rimary interface do you mean recently or at some point in the past. if the later then i think this is fixed (both upstream and with local patches) [13:36:04] it's fixed until the next corner case :D [13:36:25] I meant more in the past and more than once, hence my limited trust ;) [13:37:26] volans: facter use to just pick the first interface in the list. it now uses the routing table to pick the primary interface so unless there are two default routes it shuld work [13:38:03] for the record most of puppet uses networking.ip/ip6 or ipaddress ipaddress6 which uses the same logic so i think if it was still a genral issue we would have seen problems by now [13:39:45] ack [13:42:10] jbond: worked! \o/ [13:42:13] reimage done [13:42:26] dcaro: woot! [13:42:53] wait no xd [13:43:07] I tailed the wrong log [13:44:29] I think where we are currently running that function we'd need to use self.remote_installer() instead of self.remote_host(), based on volans previous explanation [13:44:50] without that it will never work [13:44:57] However I'd like to merge this and move the execution of this to after puppet runs: [13:44:58] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892442 [13:45:04] https://www.irccloud.com/pastebin/qlFELX4p/ [13:45:12] ^it failed in clear_dhcp_cache [13:45:13] but if there is no reason to run it earlier, better later after puppet , seems safer [13:45:39] * jbond cant find the facter PR that fixed this but the initial bug is https://tickets.puppetlabs.com/browse/FACT-380 [13:45:45] volans: Yeah. I'd like to identify the reason it needs to be run earlier, if indeed it does [13:48:13] hmm... I don't see any logs for my run :/ [13:51:01] topranks: so let's go with after puppet and no picking the iface from netbox but using networking.mac instead? [13:51:06] *and not [13:52:30] volans: sgtm, let me update my patch to include the updated facter command [13:57:45] dcaro: logs for your run will be in /home/jbond/logs as you used my custom config [13:57:55] oooohhh, true, looking [13:58:09] volans: can you review this: [13:58:09] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892442 [13:58:11] clear_dhcp_cache is currently broken [13:58:13] it can't work [13:58:18] topranks: already left a comment [14:03:09] jbond: it failed on /usr/bin/facter --no-custom-facts --no-external-facts networking.mac' also :/ [14:03:21] rc=255 [14:03:31] (probably did not find facter?) [14:03:58] wait, no, that would be 127 [14:04:32] volans: thanks :) [14:04:44] good spot on the command, I've uploaded a new patchset now [14:05:55] claime: re: switchover, master is what is currently deployed at production, right? (e.g. https://gerrit.wikimedia.org/g/operations/cookbooks/%2B/master/cookbooks/sre/switchdc/mediawiki/02-set-readonly.py ) [14:06:50] dcaro: as I said above, that couldn't work [14:07:47] volans: okok, missed that comment (thought it was unrelated), let me know when/if can I retry [14:08:38] topranks: +1ed [14:08:45] super thanks :) [14:09:10] I'll merge and try to reimage dse-k8s-worker1006 again see what happens [14:09:38] ack [14:11:54] klaus [14:11:56] ~/. [14:12:04] jynus: yes [14:25:09] topranks: o/ [14:25:33] elukey: hey [14:25:46] I'm just reimaging that host again, it's past d-i and rebooting now [14:26:03] ah super! [14:26:07] lemme know if I can help [14:26:27] np yep... I want to see what happens now (situation is kind of like it was before any changes last week) [14:27:37] ok so host is up with newly installed OS and pingable [14:27:58] being added to puppet, will see how it goes [14:28:28] I'd expect to fail during the first puppet run otherwise this would be really weird [14:29:37] yeah exactly, I'm running a few traces on the switch to try and work out what makes the comms break if it does [14:50:13] elukey: well it seemed to work fine [14:50:16] ¯\_(ツ)_/¯ [14:50:51] Reimage completed sucessfully, comms was not interrupted, the switch clear commands got issued at the end to prevent the issue happening in 12 hours on DHCP lease expiry [14:51:02] I can't really explain, I guess we should try the others? [14:51:45] topranks: sigh :( [14:51:52] I mean happy that it worked, but I am really puzzled [14:52:02] Only change we made was to the command to get the main interface MAC, but that should not have had no influence on puppet failing [14:52:03] +1 for proceed, should we clear the status of the switch? [14:52:10] yeah [14:52:35] No I think you can go ahead, switch is healthy (there are no dhcp bindings there now) [14:53:38] all right proceeding! [14:53:43] thanks a lot for the help, will report back [14:54:07] What would cause the comms issue to occur earlier than DHCP expiry is if the first command the cookbook runs was succesful [14:54:13] (clear dhcp binding) [14:54:16] But the second failed [14:54:21] (clear mac-ip-table ) [14:54:35] yep yep makes sense [14:54:48] If for some odd reason the retrieved mac was wrong, then the second command would not work, and put us prematurely in the scenario we had. [14:54:58] But as all that runs after puppet and reboot, I'm still scratching my head [14:55:08] let's see what happens on another on e [14:56:32] I also have other nodes in row E/F to reimage (ml-serve nodes) during the next days, so in case we'll battle test this :) [14:59:26] yeah, it is odd, and that they all failed means it wasn't some random thing [15:02:51] I guess dcaro should be able to retry now topranks ? [15:03:03] 👀 [15:03:16] dcaro: yes indeed, sorry should have said [15:03:24] \o/, retrying [15:03:26] you can give that another try now, let's hope it's ok :) [15:03:51] cumin1001, regular code? (not jbond's fork) [15:04:49] topranks: ^ [15:05:40] dcaro: yes, prod code [15:05:45] 👍 [15:06:00] dcaro: just to be sure [15:06:03] run puppet first [15:06:13] on the cumin host, not sure if cathal forced the deploy on both hosts [15:06:17] or jsut cumin1001 [15:06:29] oops, too late, ran it on cumin1001 [15:06:41] that one should be up-to-date [15:06:44] I did it on cumin1001 so you should be good [15:06:46] awesom :) [15:14:25] topranks: on dse-k8s-worker1007 the first puppet run failed :( [15:14:39] hmm ok [15:14:44] I guess that's good [15:14:46] did this happened on friday too? [15:14:59] it happened on all nodes [15:15:19] because now if you wait and retry in more than 10h it will fail if I undersood it correctly [15:15:40] I'm lss sure what is supposed to happen if you retry right away [15:15:54] elukey: did puppet failed because of the v4 issue or unrelate? [15:15:56] volans: correct, and I mis-spoke earlier, cos switch e3 was clean, but f1 may not be :( [15:16:57] elukey: my bad - ARP issue is present on F1 [15:16:58] volans: yes same issue [15:17:46] FWIW this is how it manifests: https://phabricator.wikimedia.org/P44879 [15:17:56] Problem is entries in "mac-ip-table" but not in ARP table [15:18:00] elukey no hurry but when you have time, may wanna rejoin #wikimedia-k8s-sig . I have a non-urgent request for help there [15:18:32] inflatador: ah yes I forgot to rejoin! [15:18:51] topranks: it passed the point where it failed the previous runs :), promising! [15:18:53] elukey: I've cleared all that down, if you want to retry when you have a moment [15:19:03] dcaro: hopefully - I could use some good news! [15:19:09] topranks: running puppet now [15:21:19] topranks: did those switches change junos version since the last time a reimage was run there? [15:21:26] maybe the syntax of the commands changed? [15:21:52] volans: no definitely not [15:39:37] topranks: 1007 looks good, I'd need to reimage 1005 and 1008 now.. they are in E1 and F3, are those switch good to go? [15:39:52] let me see [15:41:29] topranks: \o/ reimaged!! [15:41:32] elukey: cleared them both now, F3 looked ok [15:41:35] dcaro: woot! [15:42:39] elukey: I'm still confused btw. I can see how the state we were in made 1007 fail for you [15:42:53] but I can't explain how you got into that state, i.e. why the initial reimages failed last week [15:44:52] topranks: I am not sure, I recall that I also cleared the switches carefully with Arzhel's commands, and they failed anyway [15:44:57] so I have probably forgot some steps.. [15:45:23] I launched the reimages for 1005 and 1008, I'll report when the finish :) [16:24:59] topranks: I think that 1005 and 1008 are still showing the issue [16:29:08] elukey: sigh yeah I see that, same thing on 1005 [16:29:20] do you want to give it another shot and I'll monitor more closely? [17:07:53] topranks: sorry I was afk for an errand, I can retry to run puppet if you want [17:08:07] or did you mean an entirely new run? (in case we can do it tomorrow) [17:08:37] (trying with 1008) [17:27:38] (updated the task, I cleared manually dhcp+arp table on f3 for 1008 and puppet worked after a retry) [18:15:00] elukey: sorry I was distracted by the issue we had [18:15:25] I gather from the task all now working. Will need to keep an eye on it to try and find out what exactly is happening [18:15:58] but good you got them all completed :)