[10:28:44] topranks: is the homer config check failure against asw-b-codfw something known or to look at? [10:34:09] 10SRE-tools, 10Infrastructure-Foundations: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135 (10jbond) p:05Triage→03Medium [10:41:10] volans: The switch seems healthy enough overall [10:41:37] there are some cpu spikes but nothing alarming [10:41:50] did you already try a homer run? [10:42:09] the "commit check" is timing out, manually it's taking ~35 seconds to run on the switch [10:42:15] Homer is timing out after 30 seconds [10:42:39] interesting... [10:42:54] 1) how much is that slower than the others? if much.... why? :) [10:43:08] 2) if "expected" we can increase the timeout, either globally or per-device [10:44:11] 1) Tottally unscientific answer - not by very much - it takes a while always on them (runs against all 8 in the VC) [10:44:15] 2) May be a good idea [10:45:36] Since the recent upgrade the CPU usage is higher (newer Junos might just be doing more things in the background perhaps) [10:46:11] There are regular spikes every 2 days though which are a little odd [10:46:12] https://librenms.wikimedia.org/graphs/to=1678099200/type=device_processor/from=1675420800/legend=no/lazy_w=808/device=96/ [10:46:45] although we get the timeout outside of those times [10:47:12] those don't look too nice :/ [10:47:39] asw-a-codfw exhibits the same pattern after the upgrade [10:48:45] I'll take a closer look at them and see if I can work anything out [10:49:21] overall I'm not sure there is an issue, my guess is the spikes are probably some scheduled job the newer OS is doing [10:49:28] the 6h and 24h graphs show more smaller spikes from the FPCs of the VC, librenms legend doesn't really help to get exactly which one in all cases [10:52:59] topranks: memory too increased by 40%+ from ~30~40% to ~50%~75% [10:53:10] https://librenms.wikimedia.org/graphs/to=1678099800/type=device_mempool/from=1675421400/lazy_w=808/device=96/%3F_token=9PUXQ6I8RZhW9a3NDzu23PPIbB1UGaalTB0LjCmR/_token=9PUXQ6I8RZhW9a3NDzu23PPIbB1UGaalTB0LjCmR/ [10:54:44] yeah [10:55:03] could well be normal/expected but I'll try to validate that [10:55:38] In terms of the timeout I don't think it's unreasonable for us to increase from 30 to 60 possibly. Perhaps wait till Arzhel is back to get his input. [10:57:23] let's increase it for now and revisit later, better to have people be able to run homer to make changes than not :) [22:53:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney)