[10:57:59] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [10:59:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) >>! In T331470#8674700, @Jhancock.wm wrote: > I've made the patches with some changes. Port 46 on cloudsw1-b1-codfw is already configured... [11:49:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [11:49:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) 05Resolved→03Open @JHancock.wm my apologies errors abound on this one. I just realised that on the QFX5120 platform we can't mix and... [11:51:29] Hi folks - going through the vendor maintenance mailbox (clinic duty), there's a mail about Equinix Portal Login Changes (also went to noc@); is that already on your radar, or would you like a phab ticket or similar about it? [11:57:18] Emperor: let me check the mail [11:57:22] thanks for the heads up [11:59:12] NP :) [12:07:11] Emperor: don't think we need a task on it. I'll forward the NOC mail to the others who have accounts to let them know. [12:07:12] cheers [12:08:10] thanks, I'll mark the mail completed [12:35:39] is there a known issue/regression with sre.ganeti.reimage? It worked for me 1-2 weeks ago, but currently it's failing on urldownloader1003, before I have a closer look, was wondering whether it's a known thing [12:42:36] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:43:45] moritzm: is that T331478 by any chance? cc slyngs [12:43:50] T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 [12:50:03] no, this seems to be a different error, I'll have a loser look in a bit [12:53:02] I did merge a trivial change today [12:53:05] so might be my fault [12:53:29] I can look at logs in few minutes [12:54:06] moritzm: What's the error ? [12:58:10] AFAICT from the logs, the installation completes, the boot order is setup back to disk and then it's polling vain for "cat /proc/uptime" [13:29:18] * volans back [13:29:22] checking logs [13:31:58] moritzm: AFAICT it fails to get an uptime after in d-i, so basically waiting for the new system ot boot up [13:32:32] did you check the ganeti console? [13:37:48] gnt-instance console was failing to open the connection via socat, this might be an issue with the earlier VM creation, will have a closer look later [13:59:06] volans: Just update the patch for T331478, let me know if you prefer just the sleep and non of the fancy stuff I felt like writing [13:59:07] T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 [14:01:25] Aaah, normal sleep it is, CI check does not like recursion [14:01:51] slyngs: if you want to retry we do have the @retry decorator ;) [14:02:06] but I fear that might not be the issue [14:02:34] because even if data in netbox is temporarily out of sync for the VM [14:02:45] Seems a little weird that it would be the problem, given that data should already be in Netbox [14:03:06] you get the hostname of the physical ganeti host from the ganeti RAPI and then check its connection [14:03:14] so I don't see why it should not be there or be none... [14:03:57] * volans should not have replied yersterday late night to the task probably ;) [14:04:30] Is this related to the name "connected_endpoint"... does that actually check if the device is connect? [14:04:49] No, it's ganeti node, not the VM [14:05:14] the other option it's to temporary add some debugging logs, like: [14:06:21] logger.debug('node=%s, ip=%s, iface=%s, sw_iface=%s, switch=%s', node, node.primary_ip, node.primary_ip.assigned_object, node.primary_ip.assigned_object.connected_endpoint, node.primary_ip.assigned_object.connected_endpoint.device) [14:06:44] ah no, that would fail, you need one logger per line [14:08:10] Hmm, I feel like netbox sync and a retry would work [14:08:26] >>> node.primary_ip.assigned_object [14:08:26] enp94s0f0np0 [14:08:37] that's the iface of the ganeti host [14:08:42] so that can't be out of sync [14:08:48] the sync syncs only VMs data [14:09:04] yeah and even if a VM moves, at worst it should get the connected_endpoint of the wrong ganeti host [14:09:41] Then is it a bug in netbox, that causes the connection_endpoint to show up a None [14:09:59] Or is it temporarily None in Netbox [14:11:39] The thing that fails is getting the connected_endpoint for the Ganeti host, not the VM, or am I way off? [14:12:03] yes, in node.primary_ip.assigned_object.connected_endpoint.device [14:12:09] node is the ganeti host [14:12:16] assigned object is the iface on the ganeti host [14:12:24] connected endpoint is the iface on the switch [14:12:27] device is the switch [14:13:41] Good, the assigned_object is the physical hardware NIC, so that's not going anywhere [14:14:18] yep, I dunno why I thought it could have been the netbox sync yesterday [14:14:21] sorry [14:14:51] I say we slap a retry on the _clear_dhcp_cache [14:16:48] give me 5, I'm checking one last thing [14:17:01] 👍 [14:18:40] I seem to be able to replicate: [14:18:42] https://phabricator.wikimedia.org/P45512 [14:20:03] actually I hadn't filtered for active [14:20:20] edited paste, it does happen on some it seems [14:20:40] topranks: That's in the Netbox shell ? [14:21:37] yes [14:21:57] I think the issue is that the primary IP is connected to a bridge device called "private" on some of the hosts [14:22:02] which has no connection [14:22:03] topranks: print just primary_ip.assigned_object.connected_endpoint [14:22:06] so we see which one is nne [14:22:08] *none [14:22:15] ganeti6001 - private [14:23:19] also I can't explain why it worked for me yesterday [14:23:20] https://phabricator.wikimedia.org/T331478#8674505 [14:23:42] ahhh got it, the bug reports ncredir5001 but the error is actually for ncredir6001 [14:23:45] :/ [14:24:45] volans: The good news is that WE didn't break it :-) [14:24:55] slyngs: no patch needed probably [14:24:59] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) >>! In T331478#8674505, @Volans wrote: > From a quick look the current data is correct and doesn't error out: > ` >>>> node.pr... [14:25:20] Nope [14:25:21] Seems only to be an issue in drmrs [14:25:22] https://phabricator.wikimedia.org/P45515 [14:25:43] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) [14:25:48] It's another reason I plan to hassle everyone to agree on T234207 :) [14:25:50] T234207: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 [14:26:13] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) >>! In T331478#8676530, @Volans wrote: >>>! In T331478#8674505, @Volans wrote: >> From a quick look the current data is correc... [14:27:02] Sorry meant T296832 / https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/822439/ [14:27:03] T296832: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 [14:28:08] I did also re-run the puppetdb import script [14:28:09] just in case [14:29:06] it's a noop [14:29:32] topranks: so the interesting part [14:29:42] if I run in dry-run mode the netbox script to import from puppetdb [14:29:46] I get a lot of changes [14:29:48] For me the answer is to merge the patch I have above, then we can change the reimage script to say [14:29:54] that I think means that the way we configure things now are different [14:30:01] and drmrs was setup *after* we changed it [14:30:14] if type = bridge, find member ports, get connected_endpoint of that [14:30:34] volans: yeah I don't know why drmrs is different than the others here [14:30:57] I think just setup after whule the others never got updated, we run the import from puppetdb only on reimage [14:31:23] https://netbox.wikimedia.org/dcim/interfaces/23889/ [14:31:31] private's type is 1000BASE-T (1GE) [14:31:40] no related interfaces [14:33:03] actually, comparing ganeti6001 and ganeti5001 [14:33:12] both have same network configuration on the host [14:33:29] Netbox matches the host NIC naming/IPs for ganeti6001 [14:33:33] yes, but we never re-imported data into netbox from puppetdb for $reasons [14:33:38] Netbox is outdated it seems for ganeti5001 [14:33:53] indeed yeah seems like it [14:34:15] so we have 2 problems [14:34:34] 1) netbox out of sync from puppetdb, unclear why (changes applied via puppet and not during reimage?) [14:34:53] 2) the way the data is stored in netbox doesn't allow to get the ganeti host connection [14:35:00] We either merge my patch, re-import all host interfaces from puppet, and write clean code to deal with this situation, using the new vars to get phyiscal link from 'private' bridge [14:35:08] Or we come up with some wonky hack :) [14:35:36] On 1) above, yes the additional interfaces/bridges are created by puppet [14:35:46] for Ganeti hosts the primary IP is moved from physical device [14:35:56] yes but were created after some changes in puppet during normal operations? [14:36:04] a bridge is created ('private'), the IP is moved to it, and the physical int is made a member of the bridge [14:36:06] because at last reimage the data should have been imported into netbox [14:36:13] unless the reimage failed and people finished them manually [14:36:18] skipping the netbox step [14:36:54] moritzm may be able to comment [14:36:55] so either we changed the way we setup this and applied it without a reimage, or last reimage failed and not all steps were done [14:37:22] I think there is another complication - in that the switch needs to be manually changed from access port (just on private vlan), to trunk for the ganeti hosts [14:37:50] puppet checks for this and fails until it is done, usually requiring a manual change on the switch in netbox [14:38:21] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) We do have an understanding of the issue, we're discussing how to fix it. It's basically inconsistent data in netbox. [14:38:25] I'm not 100% if it's related, but perhaps its why the puppet-changes to the NICs didn't happen at first, and the data was imported to netbox with the original device setup as it was after reimage [14:38:47] that could explain it [14:39:00] meaning our current puppetization is broken as it doesn't setup the host in one go [14:39:15] (even if it's because of an external limitation) [14:39:29] yeah, invariably moritz pings me or Arzhel to change the switch port to trunk and run homer when he reimages one [14:39:51] can we do this from the cookbook itself like we do the configure switch port? [14:40:02] only needed for new ones, though [14:40:03] ultimately it all comes down to how we drive the NIC configuration from puppet, and we don't want to update the switch config automatically based on what we import back to Netbox [14:40:21] for reimages if existing hosts the existing switch config persists [14:41:27] I am planning to adjust the Netbox 'provision server' script this quarter so the switch port can be created when DC-ops run that as a trunk with the correct vlans [14:41:55] it came up only last night when b.black had an issue with missing trunk setup for LVS hosts [14:41:56] so we need a short term fix for now I guess [14:42:19] is this only for the DHCP clear ?? [14:42:28] the short-term fix is don't run it in the ganeti reimage cookbook [14:42:37] technically yes, and we could detect it in another way [14:42:37] we have no Ganeti hosts connected where it would be needed [14:43:06] we will down the road (codfw) - but it's likely we'll have a better solution to the whole dhcp clear thing by then anyway [14:45:44] slyngs: let's go with not doing the clear dhcp cache for VMs on one side [14:46:12] topranks: I'd like to re-run the puppet import in netbox for all ganeti hosts, so that at least we have consistent data [14:46:58] thoughts? [14:47:00] yeah that's definitely not a bad idea. [14:47:06] I'd rather merge my patch https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/822439/ [14:47:17] I can have a look at that too [14:47:20] and then do it, but we can do it again after that also [14:47:30] I think we maybe need to discuss, perhaps when Arzhel is back [14:47:57] but no harm in running the import script for now, if we adjust the VM reimage so it doesn't try to get the connected_device [14:48:33] if we merge my patch we can adjust the logic if primary_ip is on a 'bridge' type interface, and get the member device that has a real link instead [14:48:42] but we can discuss again [14:48:49] +1 [15:00:30] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10SLyngshede-WMF) @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not need at this point,... [15:00:38] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10SLyngshede-WMF) 05Open→03Resolved [15:25:59] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [15:30:45] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) >>! In T331478#8676676, @SLyngshede-WMF wrote: > @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not... [15:55:26] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10cmooney) >>! In T331478#8676676, @SLyngshede-WMF wrote: > @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not... [15:57:54] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10cmooney) [22:58:55] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @BBlack / @Vgutierrez is https://gerrit.wikimedia.org/r/c/operations/dns/+/793728 something that you're am... [22:59:01] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) 05Open→03Stalled