[10:30:44] <Emperor>	 Are we seeing any networking issues? ms-fe1012 (upgraded to bullseye yesterday; currently depooled) is marked as down on icinga (though I am ssh'd in OK), and seems to be having some problems with hostname resolution &c...
[10:31:07] <Emperor>	 (the icinga view of it has plenty of timeouts too)
[10:33:04] <Emperor>	 ip r s / ip a s output looks similar enough to ms-fe1011 (stretch), with perhaps the exception of
[10:33:15] <Emperor>	 default via fe80::a6e1:1a04:781:3a80 dev enp101s0f0np0 proto ra metric 1024 expires 597sec hoplimit 64 pref medium
[10:33:18] <Emperor>	 vs
[10:33:25] <Emperor>	 default via fe80::1 dev enp101s0f0 proto ra metric 1024  expires 586sec hoplimit 64 pref medium
[10:33:33] <Emperor>	 (default route from ip -6 r s)
[10:34:54] <Emperor>	 from ms-fe1012, ping6 ms-fe1011 says 'ping6: ms-fe1011: Temporary failure in name resolution'
[10:35:09] <Emperor>	 whereas on ms-fe1011, ping6 ms-fe1010 just works
[10:35:25] <Emperor>	 (likewise to ms-fe1012 from ms-fe1011)
[10:36:15] <Emperor>	 does look like (at least) DNS resolution isn't working
[10:36:49] <Emperor>	 ah, indeed ms-fe1012 cannot ping 10.3.0.1 (the nameserver per resolv.conf) whereas ms-fe1011 can.
[10:37:06] <Emperor>	 Any suggestions? I'm guessing this was working yesterday when it was reimaged...
[10:37:20] <Emperor>	 [since a bunch of stuff that was working yesterday isn't today]
[10:38:56] <Emperor>	 if icinga is right, things stopped working about 8 1/2 hours ago, so 02:00 UTC
[10:40:59] <Emperor>	 first sad syslog entry
[10:41:08] <Emperor>	 Apr 12 02:14:54 ms-fe1012 rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.   [v8.2102.0 try https://www.rsyslog.com/e/2078 ]
[10:41:26] <Emperor>	 and thereafter a bunch of others 
[10:41:44] <kormat>	 Krinkle: you definitely want to ask Amir1 or marostegui instead. :)
[10:56:02] <topranks>	 Emperor: We seem to have hit our ARP-cache bug on the Juniper QFX5120 switch again :(
[10:56:21] <topranks>	 I've cleared it now so you should be ok
[10:56:56] <topranks>	 This pattern is similar to what's happened before, we see it happen after reimage but not immediately.
[10:57:24] <topranks>	 I spent half of last week reimaging things trying to get it to happen (to take debug trace for Juniper), and it of course did not
[10:58:32] <topranks>	 In terms of this host it hasn't thus far re-occurred on any host once the connection was reset post-image, so I think ms-fe1012 should be ok from here out but we do need to keep an eye on it.
[10:59:00] <Emperor>	 topranks: thanks!
[10:59:16] <Emperor>	 yes, networking looks back to normal again now
[10:59:21] <Amir1>	 Krinkle: actually explicit joins are much better and if any change, it should make it better
[11:00:12] <Emperor>	 topranks: is there a thing I can try on the host to unwedge things? [I'm going to be reimaging a lot of nodes this quarter...]
[11:00:34] <topranks>	 Yes there are a few things.
[11:01:18] <topranks>	 firstly if you are doing any and can ping me do, I'll increase the debug level to try and get the data we need for Juniper TAC
[11:03:02] <topranks>	 To clear the issue you can do an "ifdown <interface> && ifup <interface>" on the host. 
[11:04:05] <topranks>	 I've typically done that via the iDRAC serial console though, there is a danger of cutting off the branch you are standing on if SSH'd on to the primary IP
[11:04:27] <volans>	 topranks: does a reboot help? I'm asking because the reimage cookbook does already a final reboot before the completion, so I was wondering how/why that reboot doesn't help
[11:04:57] <topranks>	 The reboot should also have the same effect yes.
[11:05:23] <topranks>	 TBH I'm not sure on the exact sequence that triggers the fault condition, I've done a bunch of reimages and haven't managed to reproduce.
[11:05:42] <Emperor>	 branch> presumably doing it screen would also be good enough
[11:05:53] <topranks>	 Once the switch gets into that state it needs the mac/ip forwarding table cleared:
[11:06:00] <topranks>	 clear ethernet-switching mac-ip-table <host-mac-address>
[11:06:25] <topranks>	 Bouncing the port up/down has the same effect as it clears any reference for the MAC relating to that port.
[11:06:49] <topranks>	 But I'm guessing in the case of a re-image the issue triggers <after> that final reboot somehow.
[11:07:08] <topranks>	 I'll keep trying to catch it "when it happens" which is what I need to get the trace for Juniper.
[11:08:05] <topranks>	 Emperor: yes probably a screen would cover it.  we can always get back in via serial console so the risk isn't huge.
[11:09:37] <Emperor>	 topranks: 02:15 UTC probably not that useful a time then...
[11:09:59] <Emperor>	 thanks, though, both for fixing it and for pointers if I see more instances; I've a note to ping you as well
[11:10:08] <topranks>	 I'm not quite A.mir but I don't keep such regular hours... you can always try!
[11:10:51] <Emperor>	 :)
[12:41:11] <Krinkle>	 Thanks Ami.r1
[15:54:30] <razzi>	 Hi all, looking for somebody with partman knowledge to weigh in: I depooled clouddb1013.eqiad.wmnet and started to reimage it, but it was missing the netboot configuration so the installer stopped at disk partitioning to ask me what to do.
[15:54:30] <razzi>	 My plan is to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488 and re-run the reimage; thoughts?
[15:59:11] <volans>	 razzi: how was it installed in the first place without a netboot entry?
[16:00:07] <razzi>	 volans: I am not sure, somebody else did it; I was surprised that there was nothing in netboot
[16:01:27] <jynus>	 razzi: be careful, that would only would work if it has the same partitioning schema than that of production
[16:04:57] <volans>	 razzi: it was removed as a protective method before the existence of hte reuse recipes, see commit 115f0290fbae02570706a9781d4fe4fbf05e415c
[16:05:28] <volans>	 they were using partman/custom/db.cfg AFAICT, I'm not sure what's the reuse-aware equivalent of that one nowadays
[16:05:34] <volans>	 and what's their hardware specs
[16:05:52] <volans>	 also, re-introducing a globbing entry will expose the non-reimaged ones to the same potential issue
[16:06:07] <volans>	 (although nowadays we also have a protection at the DHCP layer)
[16:06:47] <razzi>	 hm ok thanks for that context volans 
[16:07:00] <volans>	 so I'd rather go for an entry that lists all specific hosts that gets reimaged for now
[16:07:07] <volans>	 and turn on the globbing only once all done
[16:07:14] <volans>	 and all have the reuse recipe
[16:07:52] <volans>	 (or all are verified that will reimage fine with the reuse recipe and no data will be lost, no need to wait to reimage all hosts)
[16:10:27] <razzi>	 ok cool, I made the patch only target clouddb1013: https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488
[16:18:08] <volans>	 razzi: ack, but I don't know if custom/reuse-db.cfg is the equivalent of the old custom/db.cfg
[16:19:43] <razzi>	 volans: I upgraded dbstore100* last week using that recipe; both hosts are mysql replicas that put their data in /srv
[16:22:36] <volans>	 sure, but the partitioning of the rest of the host might be different, based on the hardware, 
[16:23:18] <volans>	 replied in the CR
[16:32:59] <razzi>	 Thanks volans; clouddb1017, which is up and pooled, serves as a mirror as clouddb1013 so the service remains available and we can restore if necessary
[16:40:53] <razzi>	 hm volans I see `Host clouddb1013.eqiad.wmnet was not found in PuppetDB but --new was not set` now; is it appropriate to pass `--new` even though it's not exactly a new host?
[16:42:26] <volans>	 razzi: yes, as it was removed from puppetdb in your first attempt
[16:42:44] <razzi>	 cool thanks volans 
[19:14:03] <dcaro>	 razzi: hey, are you working on clouddb1013?
[19:14:08] <dcaro>	 (/me got paged)
[19:16:35] <dcaro_away>	 razzi: nm, andrewbogott told me you are on it :), thanks!
[21:20:57] <jhathaway>	 anyone know of an example box with nvme drives?
[21:24:12] <volans>	 jhathaway: CP hosts, for example see https://puppetboard.wikimedia.org/fact/blockdevice_nvme0n1_size
[21:24:38] <jhathaway>	 volans: thanks!
[21:24:43] <volans>	 and semilarly named facts
[21:24:51] <volans>	 from https://puppetboard.wikimedia.org/facts
[21:25:10] <volans>	 at least that's the quickest way I thought to look at, but ther emight be others :)
[21:27:33] <jhathaway>	 not that is perfect, just needed a box to test out the lsblk -S/--scsi switch, which indeed does not show nvme drives
[21:28:33] <volans>	 I did search nvme in my inbox and got a comment from a gerrit patch
[21:28:34] <volans>	 # lsblk will report "nvme0n1", but smartctl wants the base "nvme0" device
[21:28:39] <volans>	 not sure if can be useful :D
[21:34:54] <jhathaway>	 oh, it is, that is where I am it in the smart mon code ;)
[21:35:41] <volans>	 that was https://gerrit.wikimedia.org/r/c/operations/puppet/+/588515 fwiw
[21:49:52] <inflatador>	 This is probably old news by now, but looks like Puppet got sold: https://www.zdnet.com/article/perforce-acquires-devops-power-puppet/
[21:52:21] * bd808 assumes that no ownership changes could make Puppet more annoying to work with and goes on about his business
[21:53:58] <inflatador>	 optimist!
[21:57:50] <razzi>	 I notice clouddb1020 is marked as "failed" on netbox, but in reality it's running fine. Is it ok to edit it to "active" manually in the web ui, or is there a different process for this?
[21:58:11] <volans>	 razzi: go for it if it's actually active IRL
[21:58:24] <volans>	 you can see the changelog
[21:58:29] <volans>	 to see when/who did the change
[21:58:31] <volans>	 for context
[21:59:00] <volans>	 razzi: https://netbox.wikimedia.org/extras/changelog/66303/
[21:59:34] <volans>	 that seems related to T291961
[21:59:35] <stashbot>	 T291961: clouddb1020 crash - https://phabricator.wikimedia.org/T291961
[22:04:33] <razzi>	 volans: Cool, good to know about the history of that status. Host has been up since the month after that incident; I'll update netbox
[22:04:56] <volans>	 thx