[00:44:23] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10dancy) [06:11:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) This opened {T314998} automatically. Please sync up with Netops before doing the work as live traffic is using the port. [07:15:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) I went through the useful https://apps.juniper.net/feature-explorer/select-software.html?typ=1&swName=Junos%20OS&rel=21.2R3&sid=1211&platform=MX204&pi... [07:48:46] 10netops, 10Infrastructure-Foundations: Enable LLDP on SRX facing interfaces - https://phabricator.wikimedia.org/T320229 (10ayounsi) p:05Triage→03Low [07:49:55] 10Puppet, 10Infrastructure-Foundations: Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T319681 (10hashar) [08:07:51] 10netops, 10Infrastructure-Foundations: Use Junos BGP graceful-shutdown and shutdown features - https://phabricator.wikimedia.org/T320230 (10ayounsi) p:05Triage→03Low [09:02:19] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10aborrero) [09:11:19] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) p:05Lowest→03Medium [09:12:14] 10Puppet, 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) I change the priority to medium,. The lack of a proper solution for network management causes period problems eno... [09:23:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) 05Open→03Resolved Change applied across all routers now, so hopefully the last we see this kind of issue. [09:29:10] moritzm: when the buster-installer510 is ready lmk so we can merge the cookbook patch and try a reimage of sretest1001 [09:30:38] yeah, I'm currently working on it [09:30:56] but you can go ahead and merge the cookbook patch already [10:00:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [10:23:35] volans: I've created a buster+5.10 netboot image with matching firmware and added it to volatile, cookbook patch can be merged (or I can do it as well) [10:26:01] nice, can I try it with sretest1001? or do you want to? I was planning to reimage with 5.10 and then again with the default one to check that the cookbook is not broken for standard cases [10:26:05] too [10:27:24] * volans merging in the meanwhile [10:28:12] I'm fine either way [10:28:34] spoke to John earlier and he said sretest1001 and sretest1002 are currently not used for firmware tests [10:28:49] so we can also do both in parallel [10:29:33] same for me [10:31:19] cookbook deployed on the cumin hosts [10:32:22] ack, I'll run it on sre1001 in a few [10:34:42] actually wait [10:34:47] I run puppet before jenkins merged [10:34:52] my bad [10:34:54] re-running [10:35:59] moritzm: now you can go :) [10:50:57] nite to self, add the media type to the ! log message when non-standard [10:51:30] following at the serial console, it kicked off the PXE boot now [10:56:40] I'm not getting any output on the serial console except a continuously blinking cursor [11:00:00] and then after a bit/some timeout it drops back into the installed system [11:00:43] like it wasn't able to boot via pxe and went back normal boot? [11:01:19] the output I saw was "Enumerating boot options", I suppose so [11:01:49] I'll try forcing a PXE boot over the serial console next [11:02:08] Oct 7 10:52:12 install1003 atftpd[3760]: Serving lpxelinux.0 to 10.64.48.138:2071 [11:02:18] that's the sretest1001 ip [11:02:35] too bad the logs don't say which one is serving :/ [11:04:27] moritzm: if you want to test it manually [11:04:32] I need to modify the dhcp cookbook [11:04:41] with the same patch basically [11:05:14] do we have a way to pass a kernel option via the current cookbook/DHCP integration [11:05:19] via "append" [11:05:44] not right now, but doesn't seems complex to add [11:05:45] I think it defaults to "quiet", so if we're able to pass "noquiet" we should be able to see what it attempts to do there [11:05:59] indeed [11:06:26] for right now if you want I can setup the dhcp manually and let you play with it at will [11:08:00] in teh meanwhile patch is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/840103 [11:10:43] moritzm: on install1003 [11:10:46] modify at will /etc/dhcp/automation/ttyS1-115200/sretest1001.conf [11:11:10] then, once done [11:11:36] run /usr/local/sbin/dhcpincludes -r commit [11:14:37] and that will persist agains the changes done by the cookbook? [11:14:59] no, will allow you to trigger a manual pxe reboot [11:15:07] and watch it at your will [11:15:09] actually [11:15:43] yeah it's the same of mergint that patch, then modifying manually the DHCP config file, so yeah I'd say not worth going that path [11:16:05] as the dhcp cookbook doesn't do anything PXE wise [11:16:37] actually the persistence of that file will make any run of the reimage/dhcp cookbook to fail because the file already exists [11:17:08] so after the commands above, just force a pxe boot and reboot either via ipmi or racadm [11:23:58] can you revert that /etc/dhcp/automation/ttyS1-115200/sretest1001.conf ? after poking at nginx logs I think I found the error, the wrong base OS was passed. I'd like to re-run the stock reimage cookbook to re-test [11:25:09] sure [11:25:22] I just rm-ed the file [11:25:23] go ahead [11:25:50] -os buster --pxe-media installer510 [11:25:53] should be the args [11:26:13] ah right you called it with bullseye [11:26:28] * volans stepping afk for lunch, bbiab, ping if I'm needed, not far from laptop [11:28:42] yeah, my subconcious typed bullseye as I had been adding the 5.10 kernel the the image before... [12:29:28] moritzm: how did it go? [12:31:36] need to doublecheck a few things in a second run, but seems to be going in the right direction [12:36:01] topranks: I see that https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/822439 got merged, should we re-run ImportPuppetDB on all hosts (after a dry run) to make sure everything matches reality? [12:36:17] great [12:39:39] XioNoX: em that's a good idea I think yeah, I've done a bunch of random tests on different machines on netbox-next, and seems to be ok, but yes probably worth doing [12:40:03] the change isn't merged yet though? [12:41:03] if it's merged I think it's better to deploy it right away to not forget [12:41:18] it's not merged [12:41:32] I might have caused confusion there, I accidentally gave it a +2 earlier, but I removed it right after [12:41:43] (had meant to +2 a different change and opened wrong tab) [12:41:55] ohhh [12:42:19] I see yeah, because of gate and submit I see the +2 as automatic merge [12:42:34] and got the gerrit emails, etc [12:42:39] nevermind then :) [12:43:24] sry yeah my bad :) [12:48:19] 10netops, 10Infrastructure-Foundations: Junos: send syslog through mgmt_junos - https://phabricator.wikimedia.org/T320244 (10ayounsi) p:05Triage→03Low [12:53:16] 10netops, 10Infrastructure-Foundations: Junos: use mgmt_junos for syslog and ntp - https://phabricator.wikimedia.org/T320244 (10ayounsi) [12:54:48] 10netops, 10Infrastructure-Foundations, 10SRE: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) [13:03:13] volans: there's a remaining glitch and some things to puppetise, but I've managed to make the buster+5.10 d-i work in generak. going to update the Phab task in a bit with details and next steps [13:04:19] and the cookbooks side works fine? [13:08:14] yeah, all fine. I'm going to run a standard buster reimage of sretest1002 next, but don't expect any issue [13:08:33] moritzm: nice and thank you! [13:10:10] and there goes the secret plan to trick the Traffic team into an immediate Bullseye migration :-) [13:10:35] :D [13:10:48] immediate is unlikely but we are going to work on in this quarter for sure [13:11:07] we won't push out the changes because it's a short quarter and all that but yeah, at least getting the packages ready [14:36:11] so the reimage of sretest1002 failed, but it's definitely unrelated to the install change (can't find the keytab) [14:37:59] k [14:42:20] 10netops, 10Infrastructure-Foundations: Junos: investigate BGP rib sharding - https://phabricator.wikimedia.org/T320264 (10ayounsi) p:05Triage→03Low [14:48:10] XioNoX: As a starting point, Juniper published a 41 pages "day one" document on the topic. [14:48:20] nice weekend reading :P [14:48:21] :) [14:49:50] it's funny it's an optional feature.. at least to an extent [14:50:11] all the big vendors have been improving their internal BGP processing, data structures etc. for decades making improvements [14:50:25] but normally they just feed it into the mainline release at some stage when it's stable [14:50:39] don't think I've seen something like this where it's an optional thing. [14:50:55] I think it's a significant change [14:51:05] like full re-write of the BGP stack kind of thing [14:51:49] I'd guess it will eventually be the default [14:52:39] yeah definitely looks like a big change [14:53:23] topranks: can you link or forward to me? [14:53:45] https://www.juniper.net/documentation/en_US/day-one-books/DO_BGPSharding.pdf [14:53:58] Phab task is linked just above [14:54:30] ah thanks! [16:25:28] 10netops, 10Infrastructure-Foundations, 10SRE: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) p:05Triage→03Low