[11:20:08] hey ho infra team... I hit on a bit of an odd one here [11:20:26] so I was re-testing the puppetdb import to netbox script [11:20:34] and trying it on netbox-next with ganeti5005 [11:21:17] prior to running the device showed the intitial state after debian installer, before puppet changed the network setup [11:21:21] https://usercontent.irccloud-cdn.com/file/6w7eVcd6/image.png [11:21:53] the script worked as expected, and pulled in the data from puppetdb, moved the primary IPv4 to the 'private' bridge device matching what is set up on the host [11:22:11] https://usercontent.irccloud-cdn.com/file/GGvSgPPe/image.png [11:22:37] The thing you'll notice, however, is there is still a v6 IP on the physical interface [11:23:03] The actual server does not have this IP configured on the 'private' bridge, which is why the script hasn't pulled it in from puppetdb and moved it to that [11:23:16] So I guess the question is [11:23:40] our server provisioning assigned this IP, 2001:df2:e500:101:10:132:0:8/64 [11:23:54] but the server isn't using it and it's not configured anywhere [11:24:02] so what should we do for it in Netbox? [11:24:20] do we record it's assigned to the device somehow? do we delete it??? [11:37:02] interesting question [11:38:17] yeah - it kind of cuts to the whole "source of truth" thing [11:38:48] and is a symptom of the wider issue that our host interface config isn't fully pushed from netbox [11:39:05] so we always issue dual stack v4+v6 [11:39:11] topranks: thats the address the server gets if we set ` profile::base::production::enable_ip6_mapped: true` [11:39:14] even if the host doesn't use it d-i do assign it to the host [11:39:27] sorry post-d-i [11:39:27] ideally we'd set up the interfaces on the host appropriately at the "server provision" stage, and the host config would get everything it needed from netbox [11:40:05] jbond: ok, so I'm guessing we've probably a bunch of servers with that set to false [11:40:21] and hence no IPv6 configured, but we do have one listed for the int in netbox [11:40:21] ritgh now it has a slaac address based on the mac so it works with v6 and in theory it shuld be safe to set that to true [11:40:26] hieradata/role/common/ganeti.yaml:profile::base::production::enable_ip6_mapped: false [11:40:29] hieradata/role/common/ganeti_test.yaml:profile::base::production::enable_ip6_mapped: false [11:40:46] there is a task there may be some notes about why ganeti dosn;t have it [11:41:05] volans: yeah we always assign [11:41:10] and I guess some hosts don't use / configure it [11:41:27] but complicated here cos ganeti creates the bridge and the IPs get moved to that [11:42:05] it's not really a disaster if we leave it working like this - with the orphan IP on the physical [11:42:23] with a view to changing it down the road when/if we move to fully drive host networking from netbox [11:42:30] topranks: we should try to move everything to uses the mapped address but its not always been the case so there is some fear with some services [11:42:38] topranks: there is another layer, the AAAA records [11:42:39] however it seems ganeti use to have it at one stage https://gerrit.wikimedia.org/r/c/operations/puppet/+/531227/2/manifests/site.pp [11:42:45] some services don't have it because they are not v6 ready [11:42:50] volans: the reverses? [11:42:57] no the direct too [11:42:57] ok yep [11:43:19] see T253173 [11:43:19] T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173 [11:43:46] volans: ok, but that doesn't usually include the interface right? [11:44:03] so dns wouldn't change whether IP was associated with the correct int or not [11:44:27] it obviously *would* be something to consider in terms of whether we should have the IP allocated in the first place if not used [11:46:30] topranks: see here for why its disabled on ganeti T233906 [11:46:31] T233906: Broken network connection on ganeti2001 after reboot - https://phabricator.wikimedia.org/T233906 [11:47:31] * jbond just noticed there a comment in the hiere file withthat infe doh! [11:49:08] jbond: haven't dug into it fully, but seems related to T320429 [11:49:10] T320429: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 [11:49:36] I guess my real query here isn't about why IPv6 might be disabled, but what netbox should show in that case [11:49:42] id say we shuld allocate it in netbox. in an ideal world we would toipindeed looks similar [11:49:52] current precedent seems to be to leave assigned, and I think that does make sense [11:49:58] sorry forgot i was halfway through a sentence ignore that [11:50:05] but yes T320429 seems like the same issue [11:50:25] yes i think leave assigned as ideally we will solve this issue and it will get a ipv6 [11:50:38] *will get *that* ipv6 address [11:51:01] if we enabled v6 on ganeti the IP would be on the 'private' int, and I suppose what I'm getting at here is that when it's disabled it shows on the physical int [11:51:34] thinking about it / with this discussion I'm inclined not to try to add any logic, and just leave the _assigned_ IP on the physical, even though if it's used it will appear on the bridge [11:51:58] sgtm [11:51:59] doing anything else is not really "importing from puppetdb" [11:52:17] and removing because it's not in puppetdb would remove the allocation/IP completely in netbox [12:33:46] there is an issue there though: [12:33:46] ganeti5005.eqsin.wmnet has address 10.132.0.8 [12:33:46] ganeti5005.eqsin.wmnet has IPv6 address 2001:df2:e500:101:10:132:0:8 [12:34:04] but the v6 IP isn't live [12:34:38] so that should probably be fixed by removing the dns name from the IP [12:34:50] and ideally setting the IP as "reserved" [12:35:26] another issue is that it still have a v6 SLAAC IP on the private vlan: [12:35:26] private: mtu 1500 state UP qlen 1000 [12:35:26] inet6 2001:df2:e500:101:b49b:91ff:fe9f:92ca/64 [12:37:34] so we should go fully one way (remove v6) or the other (fix it) [12:46:47] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.provision cookbook: check for both default and wmf password - https://phabricator.wikimedia.org/T333554 (10ayounsi) p:05Triage→03Low [12:51:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link goes down? [12:51:13] XioNoX: In general fully agree on that [12:51:32] might be a scope creep compared to what you're tying to do though :) [12:51:44] I'm not sure detecting those discrepancies and making the changes should be part of the puppet import script though [12:52:13] yeah, for right now I won't include it for the sake of getting the current task completed [12:52:23] happy to look at it after though no probs [12:52:31] sgtm! [12:53:51] one idea - which might not be too complicated in terms of where the logic exists [12:54:09] is to go through all the IPs attached to device interfaces in netbox when running the puppetdb import [12:54:18] and set any that aren't in puppetdb to "reserved" ? [12:54:32] and maybe remove dns? [12:55:38] could be yeah, if it's not too time consuming you can do a one off run to see what the differences would be infra wide [12:56:06] I'll leave it for now, the code would go into the 'ip address' part of the import script, which isn't something I've touched [12:56:08] I wouldn't be surprised there are some edge cases, but I can't think of any [12:56:20] so I think a separate patch to review would make more sense anyway [12:56:34] but yep I'll have a look at it when the parent int stuff is done [13:54:05] hi folks [13:54:33] we recently reimaged an LVS box to bullseye and ran into the interface name limit restriction, so we disabled the legacy vlan naming and the new ones look like: https://netbox.wikimedia.org/dcim/devices/4474/interfaces/ [13:54:53] in this case of the reimage though, I had to manually set the vlan1201 interface as virtual and set the parent to enp175s0f0np0 [13:55:15] we will be reimaging the same host to bullseye, so I was wondering if I should delete these interfaces first and then do the reimage or should the reimage pick it up? [13:55:23] (I've already replied that in theory should work, but are LVSes, so better double check and top.ranks might be more informed on this) [14:03:32] I guess the easiest way to try out is to delete the interfaces and let the cookbook populate them [14:03:43] but I have come to the realization that with Netbox, deleting is a one-way street so :) [14:37:03] fwiw, I am attempting a reimage with the deleted virtual interface and will report back [14:37:32] if it doesn't work, I have a pinned tab with the old information that I will just use to recreate the interface if the cookbook doesn't [14:39:51] ack [14:49:28] so I had deleted the virtual interface prior to running the cookbook [14:49:31] after, it created one but two things: [14:49:32] https://netbox.wikimedia.org/dcim/devices/4474/interfaces/ [14:49:51] 1) it didn't set it as virtual, as it usually does *I think* or maybe toprank.s did that manually, I don't think so though [14:49:59] 2) it didn't set the parent for the virtual [14:50:02] other than that, no issues [14:52:47] fwiw, in the link above, I manually added it, it wasn't there before (just in case someone opens the link and sees that it is there :) [14:55:14] lol [14:55:30] thx for reporting back [14:57:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8741660, @ayounsi wrote: > How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link g... [15:35:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Sounds good to me. This is what we need to do with cloudcontrol2004-dev: * figure out how to... [15:35:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) Third batch |Host|U space|Existing port|New port| |cloudcephosd2001-dev|3|asw-b1-codfw ge-1/0/... [15:46:10] sukhe: I've a patch to change the puppetdb import script to work properly in your scenario [15:46:27] https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/822439/ [15:46:41] so hopefully by the time you do next one it won't be an issue [15:49:35] oh wow! thanks topranks <3 [15:50:30] it was a simple fix manually but of course this is better [15:50:33] I ran the import on netbox-next where that code is live for testing seems to be ok: [15:50:33] https://netbox-next.wikimedia.org/dcim/devices/4474/interfaces/ [15:50:35] much appreciated! [15:51:12] the good thing is with that data we can report on any switch-side interfaces that don't have the right vlans [15:51:28] instead of you guys finding out randomly after some time like the last ones b.black hit [15:52:52] :) [15:53:09] I remember you mentioned this script once but I guess it's been a while since talked about LVS [15:53:19] this only came up yesterday as we started the bullseye reimage [16:43:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) All remaining (non public-vlan) hosts have been moved and look good to me (reachable, MAC addr... [20:48:18] 10SRE-tools, 10netbox, 10DNS, 10Infrastructure-Foundations, 10Traffic: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10BCornwall) [21:33:50] 10SRE-tools, 10netbox, 10DNS, 10Infrastructure-Foundations, 10Traffic: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Correct, and we've already the first validators in netbox-next that will be released to prod shortly so this can b...