[09:06:48] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) I tested the DNS generation script by: Manually creating `/etc/netbox/dns.cfg` Adding `127.0.0.1 netbox-next.wikimedia.org` to `/etc/hosts` Then running ` netbox-de... [09:30:06] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) On ganeti-netbox-sync, not sure if the bug I found is relevant to 3.2. I manually added the following to `/etc/netbox/ganeti-sync.cfg` ` [profile:ulsfo] api=https://ganet... [09:41:15] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7916593, @cmooney wrote: >> If there is any kind of anycast with the k8s prefixes (same prefix adverti... [11:35:06] XioNoX, topranks: could one of you please run "sudo keyholder arm" on cumin2002? I don't have access to the homer passphrase in pwstore [11:36:47] moritzm: yep let me give it a try [11:37:36] thx, it's in pwstore under homer-key-passphrase [11:37:46] hmm... seems I may not have access to it either, gpg failing when I try with "decryption failed: No secret key" [11:38:14] that sounds like a failing ssh agent, maybe kill/restart it so that it freshly queries your passphrase? [11:38:39] I'm not sure it is, decrypting the 'network-root' to try immediately afterwards it queried the passphrase again [11:39:07] I'm doing this here on my local machine directly, no ssh involved. [11:39:42] ah, ok [11:40:07] let's wait for Arzhel then, maybe the ACL (which I can't check either) applies to his UID rather than the netops group [11:40:15] looking [11:40:51] access: @netops [11:41:21] strange, Cathal is is the netops group [11:41:40] could you please rearm the keyholder on cumin2002 in the mean time? [11:41:47] (done) [11:42:15] moritzm: maybe the file didn't get re-encrypted since Cathal joined? [11:44:31] oh indeed, that's the issue. I re-encrypt all files whenever someone new starts, but ofc I cannot re-encrypt files I don't have access to myself [11:45:01] can you run "pws rc homer-key-passphrase" and commit the updated file? [11:45:04] then it should work [11:45:18] moritzm: done [11:45:35] topranks: can you pull the new file and try again? [11:46:01] super yep that's working now! [11:46:17] Is it possible to do this also with the "snmp-community" file ? [11:46:36] I don't have access to that, but when I've needed it I just copy from plaintext on router cli ;) [11:46:39] let me gather a list of other files which might need the same [11:47:13] I believe I've the other stuff needed, network-root, management-scs [11:47:18] ./pws ed snmp-community [11:47:18] snmp-community is probably not readable [11:47:23] Snmp one is the only one I had an issue with previously [11:47:25] yeah I never used that file [11:47:36] ok no probs, it's never caused me any issue like I said I can get it elsewhere [11:49:56] ARIN, gtt, managent-scs, network-root, snmp-community and Wikimedia-ARIN-RPKI.key are the file I can't re-encrypt [11:50:59] moritzm: thanks [11:51:21] I've access to ARIN and GTT portals, but the RPKI key probably I may need sometime, and I guess in general best to have all of them [11:51:27] (not knowing exactly what they contain) [11:52:32] I pushed a re-encrypt of all of those files (except snmp-community) [11:53:31] great thanks... all now working :) [11:55:10] awesome! [12:18:47] I'm trying to reimage ganeti4001 to bullseye, but it's not getting a DHCP response in d-i, there might be something off with the setup of the dhcp 82 stuff? not sure if we have done OS reimages in ulsfo since that was enabled? [12:19:55] moritzm: VMs don't use option 82 [12:20:32] er, nevermind [12:21:03] yeah in theory it should just works [12:23:15] let me know when you're re-running it so I can check what's going on [12:23:51] I can re-run it now, I'm on the serial console and can attempt network config again [12:24:02] moritzm: sounds good! [12:24:11] it's trying now [12:26:01] moritzm: did the cookbook say which DHCP server it's configuring? [12:26:29] I don't see the option 82 snippet on install4001 [12:26:53] let me check whether I can see that in the logs [12:27:19] I don't see anything on tcpdump neither [12:30:28] per the logs it added a config to /etc/dhcp/automation/ttyS1-115200/ganeti4001.conf on install4001 [12:30:36] but that file is no longer around it seems [12:30:55] I can retry the cookbook, maybe there was a race of sorts? [12:31:00] sure [12:31:15] ack, I'm trying that now [12:33:28] let me know when the file is created (or should have) [12:34:34] (I have to step away in 5min for an errand) [12:47:51] ack, it was now created on install4001, following along over the serial console if the installation now works fine [12:53:17] yeah, so the dhcp config file is now present on install4001, but DHCP network config is still failing in d-i, let's catch up when you're back [13:11:57] moritzm: I'm back [13:13:43] moritzm: it fails to get DHCP in D-I or in PXE? [13:13:51] d-i [13:14:47] for some reason /etc/dhcp/automation/ttyS1-115200/ganeti4001.conf is now gone from install4001, maybe it got expired... [13:16:05] moritzm: the cookbook creates it and removes it [13:16:23] so that works fine, option 82 (and that file) are only used for PXE [13:18:01] https://www.irccloud.com/pastebin/mf8kWiTD/ [13:18:08] ok, trying to check d-i if there's some other cue [13:19:58] moritzm: you're still in d-i? can you re-run a dhcp request? [13:20:24] sure, triggering one now [13:20:38] it's requesting [13:20:55] not seeing any dhcp traffic on install4001 [13:22:04] xe-1/0/9 up down ganeti4001 {#1052} [13:22:10] nic is down? [13:22:16] driver issue maybe? [13:22:35] having a look at dmesg [13:25:19] tg3 seems to be loading/loaded just fine, can also rmmod and modprobe it again [13:26:59] also doesn't appear to be some sort of firmware issue, the logs state "no missing firmware in loaded kernel modules" [13:28:49] can you check the NIC's status? [13:29:09] could it try to boot on a different nic? [13:31:54] yeah, I'll try configuring the designated IP address by hand next [13:39:03] if the NIC (with MAC b0:26:28:3a:f2:a0) is down, it won't help to set a manual IP [13:45:16] * volans catching up backlog [13:45:19] an "ip link set dev enp175s0f0np0 up" (which is the interface for b0:26:28:3a:f2:a0) works fine, does the state on the swift change with that? [13:48:13] moritzm: yeah it's up now [13:48:28] can you re-run dhcp? [13:48:42] weird that the file is not there anymore, as the reimage cookbook removes it only after d-i has completed [13:48:47] (the dhcp one) [13:49:03] volans: or times out? [13:49:35] sure, if the cookbook fails it gets GCed [13:52:47] DHCP triggered from d-i still fails, but in the mean time the 120 retries were attempted in the cookbook, so that made it roll back [13:52:58] I'll try to re-run the cookbook [13:53:35] hm, I'm still not seeing anything on install4001 [13:53:58] so for some reasons dhcp packets are not making it out of d-i [13:54:25] or are lost in juniper-land :D [13:55:07] I've kicked off a new cookbook run [13:56:18] volans: let's see, I started a packet capture on the router (dhcp relay) too [13:56:19] is "asw2-ulsfo:xe-1/0/9.0:private1-ulsfo" correct? [13:56:28] yep [13:56:37] switch hostname : port : vlans (for context for others) [13:56:41] volans: PXE works fine as it gets to d-i [13:56:46] so option 82 works fine [13:56:49] right [13:58:49] moritzm: let me know when it gets into d-i [13:58:53] looks like PXE completed [13:59:22] ack [13:59:43] (meeting but will keep an eye on it) [14:01:59] it's now in d-i and DHCP is failing there [14:02:15] moritzm: from install_console? [14:02:41] volans: via serial console, it's still in d-i [14:02:53] ok, not seeing any DHCP traffic on the router interface neither [14:03:13] sorry, very stupid question... if d-i is failing dhcp... couldn't possibly be install_console LOL [14:03:36] is the NIC shown as down on the switch again, shall I re-run the "ip link set dev enp175s0f0np0 up" ? [14:03:37] so I'd bet something is wrong with the NIC [14:04:12] xe-1/0/9 up down ganeti4001 {#1052} [14:04:13] yep [14:04:33] enp175s0f0np0... what a wonderful name [14:04:43] I'll set up and then triggering a DHCP re-run [14:04:51] Elon Musk's next kid? [14:04:58] lol [14:06:42] still failing, maybe the NIC is in fact botched [14:07:00] nic still down [14:07:07] there's a fourth new Ganeti node which I had been meaning to add to the cluster as part of the OS update anyway [14:07:21] so I'll add that and then re-attempt to reimage ganeti4002 [14:07:32] and defer 4001 to dc ops [14:07:57] sgtm! [14:08:02] oh, it's under warranty for four more days :-) [14:08:10] Purchase date 2019-05-15 [14:08:23] might be worth upgrading firmwares, checking bios etc too [14:08:37] I'll ping Rob to hurry with a possible warranty case :-) [14:08:59] yeah, that's also a good idea, will ask to include that [14:21:27] lol for the 4 days warranty [14:26:39] Someone didn't do their job right and it broke 4 days early :-) [14:34:49] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Papaul) @Volans the only reason i see is the size of the disks and number of disks. We are using software RAID on 8x ~2TB disks [14:44:09] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10Volans) >>! In T296452#7920080, @ayounsi wrote: > `/srv/netbox-exports/dns.git` doesn't exist as expected, and the DNS generation went fine. Yes, but you don't know if the generat... [14:46:31] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10Volans) >>! In T296452#7920155, @ayounsi wrote: > Then manually ran the "import from puppetDB Netbox script for bast4003 (not sure if that should have been automated or not). That... [15:04:45] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0... [15:24:12] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 (10ayounsi) >>! In T296452#7921268, @Volans wrote: >>>! In T296452#7920080, @ayounsi wrote: >> `/srv/netbox-exports/dns.git` doesn't exist as expected, and the DNS generation went fin...