[13:05:21] volans, jbond: should we fail over netbox to codfw for the network maintenance ? [13:05:53] I'm not up on all the details but I know there was some problems when we ran from codfw recently right? [13:08:41] do we actually need to? it seems fine to simply not use Netbox and related cookbooks during the window [13:11:53] AFAICS the only missing prep part for IF is to stop Puppet (for puppetmaster1001) [13:13:08] moritzm: yeah I'm inclined to agree, just not using it ought to be fine [13:13:35] wanted to double check [13:15:16] moritzm: puppetmaster1001 is in row B? [13:15:17] puppetmaster1004 is in row A so will be affected [13:15:41] I don't have a strong preference for this, if anyone rather wants to failover, by all means do it :-) [13:15:59] oh, sorry mental typo since I use it so often to puppet-merge things :-) I meant puppetmaster1004 ofc [13:20:46] topranks: yes, it was super slow, has never been tested, and some hiera config should have been migrated too and they were not [13:21:03] so we decided to roll it back and there is a task open to evaluate how to do HA for netbox in general [13:21:23] ok thanks, so no point trying to fail over for this one [13:24:30] moritzm: for puppetmaster1004 what's the best way to proceed? [13:24:45] do we set to offline in hieradata/common/puppetmaster.yaml ? [13:26:14] topranks: I guess netbox will be offline during the upgrade? [13:26:35] do you need to run any cookbook that gets data from netbox and/or modify any data in netbox *during* the upgrade? [13:26:47] volans: yep it will be offline [13:27:03] but no, we don't run anything that needs to talk to it during the upgrade, it's essentially a straight reboot [13:28:05] ok, so no homer involved [13:28:20] topranks: for the previous switch maintenance windows where puppet masters were involved we simply stopped puppet on the affected sites [13:28:41] volans: correct [13:28:54] moritzm: ack thanks [13:29:59] sudo cumin 'A:eqiad or A:drmrs or A:esams' 'disable-puppet "Switch reboot: T329073"' [13:30:00] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [13:30:10] I'll take care of it in 20m [13:33:19] moritzm: super thanks :) [13:39:17] topranks: at this point I'd say when you announce the maintenance start in the various IRC channels you could also mention that netbox will be down and so the run of cookbooks that require it should be postponed to the end of the maintenance [13:39:46] volans: yep will do [13:39:52] thanks! [13:51:32] I've disabled Puppet [13:54:19] moritzm: thanks! [13:59:50] topranks: fyi pki is failed over [14:00:03] jbond: great thanks :) [15:19:25] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [15:35:14] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:38:41] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:50:42] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [15:51:56] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney I update the table with lengths between all the racks. [16:06:58] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) [16:20:15] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10jbond) 05Open→03Resolved reboot singloe cookbook now updated [18:59:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) p:05Triage→03Low [19:00:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) [19:00:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [19:03:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) [19:30:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Peachey88) [21:05:41] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) p:05Triage→03High a:03SLyngshede-WMF Interesting, thanks for the report @BCornwall @SLyngshede-WMF could you have a look please? From a quick... [22:01:40] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) First of all, thanks so much for the Ganeti cookbook -- it's a lifesaver. I can't imagine reimaging these hosts without the cookbook and all the man... [22:21:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Papaul) 05Open→03Resolved a:03Papaul @Jhancock.wm thank you we can resolve this task [22:22:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul)