[10:30:44] Are we seeing any networking issues? ms-fe1012 (upgraded to bullseye yesterday; currently depooled) is marked as down on icinga (though I am ssh'd in OK), and seems to be having some problems with hostname resolution &c... [10:31:07] (the icinga view of it has plenty of timeouts too) [10:33:04] ip r s / ip a s output looks similar enough to ms-fe1011 (stretch), with perhaps the exception of [10:33:15] default via fe80::a6e1:1a04:781:3a80 dev enp101s0f0np0 proto ra metric 1024 expires 597sec hoplimit 64 pref medium [10:33:18] vs [10:33:25] default via fe80::1 dev enp101s0f0 proto ra metric 1024 expires 586sec hoplimit 64 pref medium [10:33:33] (default route from ip -6 r s) [10:34:54] from ms-fe1012, ping6 ms-fe1011 says 'ping6: ms-fe1011: Temporary failure in name resolution' [10:35:09] whereas on ms-fe1011, ping6 ms-fe1010 just works [10:35:25] (likewise to ms-fe1012 from ms-fe1011) [10:36:15] does look like (at least) DNS resolution isn't working [10:36:49] ah, indeed ms-fe1012 cannot ping 10.3.0.1 (the nameserver per resolv.conf) whereas ms-fe1011 can. [10:37:06] Any suggestions? I'm guessing this was working yesterday when it was reimaged... [10:37:20] [since a bunch of stuff that was working yesterday isn't today] [10:38:56] if icinga is right, things stopped working about 8 1/2 hours ago, so 02:00 UTC [10:40:59] first sad syslog entry [10:41:08] Apr 12 02:14:54 ms-fe1012 rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2102.0 try https://www.rsyslog.com/e/2078 ] [10:41:26] and thereafter a bunch of others [10:41:44] Krinkle: you definitely want to ask Amir1 or marostegui instead. :) [10:56:02] Emperor: We seem to have hit our ARP-cache bug on the Juniper QFX5120 switch again :( [10:56:21] I've cleared it now so you should be ok [10:56:56] This pattern is similar to what's happened before, we see it happen after reimage but not immediately. [10:57:24] I spent half of last week reimaging things trying to get it to happen (to take debug trace for Juniper), and it of course did not [10:58:32] In terms of this host it hasn't thus far re-occurred on any host once the connection was reset post-image, so I think ms-fe1012 should be ok from here out but we do need to keep an eye on it. [10:59:00] topranks: thanks! [10:59:16] yes, networking looks back to normal again now [10:59:21] Krinkle: actually explicit joins are much better and if any change, it should make it better [11:00:12] topranks: is there a thing I can try on the host to unwedge things? [I'm going to be reimaging a lot of nodes this quarter...] [11:00:34] Yes there are a few things. [11:01:18] firstly if you are doing any and can ping me do, I'll increase the debug level to try and get the data we need for Juniper TAC [11:03:02] To clear the issue you can do an "ifdown && ifup " on the host. [11:04:05] I've typically done that via the iDRAC serial console though, there is a danger of cutting off the branch you are standing on if SSH'd on to the primary IP [11:04:27] topranks: does a reboot help? I'm asking because the reimage cookbook does already a final reboot before the completion, so I was wondering how/why that reboot doesn't help [11:04:57] The reboot should also have the same effect yes. [11:05:23] TBH I'm not sure on the exact sequence that triggers the fault condition, I've done a bunch of reimages and haven't managed to reproduce. [11:05:42] branch> presumably doing it screen would also be good enough [11:05:53] Once the switch gets into that state it needs the mac/ip forwarding table cleared: [11:06:00] clear ethernet-switching mac-ip-table [11:06:25] Bouncing the port up/down has the same effect as it clears any reference for the MAC relating to that port. [11:06:49] But I'm guessing in the case of a re-image the issue triggers that final reboot somehow. [11:07:08] I'll keep trying to catch it "when it happens" which is what I need to get the trace for Juniper. [11:08:05] Emperor: yes probably a screen would cover it. we can always get back in via serial console so the risk isn't huge. [11:09:37] topranks: 02:15 UTC probably not that useful a time then... [11:09:59] thanks, though, both for fixing it and for pointers if I see more instances; I've a note to ping you as well [11:10:08] I'm not quite A.mir but I don't keep such regular hours... you can always try! [11:10:51] :) [12:41:11] Thanks Ami.r1 [15:54:30] Hi all, looking for somebody with partman knowledge to weigh in: I depooled clouddb1013.eqiad.wmnet and started to reimage it, but it was missing the netboot configuration so the installer stopped at disk partitioning to ask me what to do. [15:54:30] My plan is to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488 and re-run the reimage; thoughts? [15:59:11] razzi: how was it installed in the first place without a netboot entry? [16:00:07] volans: I am not sure, somebody else did it; I was surprised that there was nothing in netboot [16:01:27] razzi: be careful, that would only would work if it has the same partitioning schema than that of production [16:04:57] razzi: it was removed as a protective method before the existence of hte reuse recipes, see commit 115f0290fbae02570706a9781d4fe4fbf05e415c [16:05:28] they were using partman/custom/db.cfg AFAICT, I'm not sure what's the reuse-aware equivalent of that one nowadays [16:05:34] and what's their hardware specs [16:05:52] also, re-introducing a globbing entry will expose the non-reimaged ones to the same potential issue [16:06:07] (although nowadays we also have a protection at the DHCP layer) [16:06:47] hm ok thanks for that context volans [16:07:00] so I'd rather go for an entry that lists all specific hosts that gets reimaged for now [16:07:07] and turn on the globbing only once all done [16:07:14] and all have the reuse recipe [16:07:52] (or all are verified that will reimage fine with the reuse recipe and no data will be lost, no need to wait to reimage all hosts) [16:10:27] ok cool, I made the patch only target clouddb1013: https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488 [16:18:08] razzi: ack, but I don't know if custom/reuse-db.cfg is the equivalent of the old custom/db.cfg [16:19:43] volans: I upgraded dbstore100* last week using that recipe; both hosts are mysql replicas that put their data in /srv [16:22:36] sure, but the partitioning of the rest of the host might be different, based on the hardware, [16:23:18] replied in the CR [16:32:59] Thanks volans; clouddb1017, which is up and pooled, serves as a mirror as clouddb1013 so the service remains available and we can restore if necessary [16:40:53] hm volans I see `Host clouddb1013.eqiad.wmnet was not found in PuppetDB but --new was not set` now; is it appropriate to pass `--new` even though it's not exactly a new host? [16:42:26] razzi: yes, as it was removed from puppetdb in your first attempt [16:42:44] cool thanks volans [19:14:03] razzi: hey, are you working on clouddb1013? [19:14:08] (/me got paged) [19:16:35] razzi: nm, andrewbogott told me you are on it :), thanks! [21:20:57] anyone know of an example box with nvme drives? [21:24:12] jhathaway: CP hosts, for example see https://puppetboard.wikimedia.org/fact/blockdevice_nvme0n1_size [21:24:38] volans: thanks! [21:24:43] and semilarly named facts [21:24:51] from https://puppetboard.wikimedia.org/facts [21:25:10] at least that's the quickest way I thought to look at, but ther emight be others :) [21:27:33] not that is perfect, just needed a box to test out the lsblk -S/--scsi switch, which indeed does not show nvme drives [21:28:33] I did search nvme in my inbox and got a comment from a gerrit patch [21:28:34] # lsblk will report "nvme0n1", but smartctl wants the base "nvme0" device [21:28:39] not sure if can be useful :D [21:34:54] oh, it is, that is where I am it in the smart mon code ;) [21:35:41] that was https://gerrit.wikimedia.org/r/c/operations/puppet/+/588515 fwiw [21:49:52] This is probably old news by now, but looks like Puppet got sold: https://www.zdnet.com/article/perforce-acquires-devops-power-puppet/ [21:52:21] * bd808 assumes that no ownership changes could make Puppet more annoying to work with and goes on about his business [21:53:58] optimist! [21:57:50] I notice clouddb1020 is marked as "failed" on netbox, but in reality it's running fine. Is it ok to edit it to "active" manually in the web ui, or is there a different process for this? [21:58:11] razzi: go for it if it's actually active IRL [21:58:24] you can see the changelog [21:58:29] to see when/who did the change [21:58:31] for context [21:59:00] razzi: https://netbox.wikimedia.org/extras/changelog/66303/ [21:59:34] that seems related to T291961 [21:59:35] T291961: clouddb1020 crash - https://phabricator.wikimedia.org/T291961 [22:04:33] volans: Cool, good to know about the history of that status. Host has been up since the month after that incident; I'll update netbox [22:04:56] thx