[07:54:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:37] XioNoX || topranks: I'm currently running a bunch of decoms for k8s workers and I realise that we don't remove the BGP config (even when running decom with --run-homer) during the process. Is that an issue or does that happen at a later point when the hosts get removed from netbox? [08:17:09] jayme: there are 2 things worth improving, the BGP flag stayed set, so if the host is re-purposed it tries to setup BGP for it [08:17:17] jayme: what do you mean you don’t remove the bgp config? you mean the field in Netbox? [08:17:50] XioNoX: heh I’d not thought of that repurposed host issue yeah that could happen [08:17:59] then, we should have the decom cookbook run homer on the core routers if the BGP sessions were established there. That will become a NOOP as devices peer with their ToR [08:18:02] im mean the field in netbox, yes .. and as a consequence some config on the switches [08:18:10] topranks: it happend a couple times, not a big deal [08:18:30] jayme: I guess outside of that unless the host status is “active” in Netbox we won’t generate the BGP config with Homer, regardless of what the BGP flag says [08:19:19] and last, by default, for devices conntected to Juniper switches, the decom cookbook will just update the port info on their ToR, but not touch the BGP peering setting, this will be "fixed" once we set the cookbook to always run homer by default [08:20:01] I was running the cookbook with --run-homer since I thought it's required to remove the BGP peering settings from ToR/CR [08:20:17] XioNoX: I guess the decom cookbook should set bgp to false ideally? [08:20:19] but to my untrained eye it does not seem like it does that [08:21:37] jayme: using --run-homer will do the right thing if the host is peering with its ToR, but won't if it's peering with the core routers [08:21:45] topranks: yeah [08:22:21] XioNoX: okidoke, I'll run homer for cr* then when the coobook has completed [08:22:58] jayme: thx! [08:26:02] jayme, topranks: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1275806 [08:27:18] new pki discovery2026 up and running! [08:28:04] lovely...more expiry dates! :D [08:28:07] the next steps is to migrate some low-impact clients to it and verify that everything looks as expected [08:28:15] so let's start with wikikube [08:28:16] :D [08:28:34] elukey: you can go for wikikube-staging in codfw [08:28:57] jayme: too easy, I want prod or nothing [08:28:58] and test out the force-refresh-cert mechanism of your choosing there [08:30:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:28] I would suggest to try to be a bit clever on which certs to force-refresh so we don't refresh all at the same time [08:31:52] during the upgrade of wikikube I saw a couple of errors bubbling up from the PKI since it was a bit overloaded by all the cert requests in a short timeframe (I think) [08:32:21] it settles after some time since cert-manager/cfssl-issuer retries [08:35:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:26] XioNoX: related Q: When I get "JunOS config commit failed - see above - device may need Homer run" how do I know which switch to run homer for? [08:38:47] jayme: can you share the full output? [08:38:48] "see above" is a bit ambigous when running decom for 20 hosts :) [08:39:01] or the relevant part of the output :) [08:39:33] detecting what's relevant is the issue here I suppose :) [08:39:53] jayme: seme me as much as possible and I'll have a look [08:40:19] jayme: I don't think I've implemented it thinking that people would decom many servers at once, so I might need to add some logging [08:40:50] yeah, thats what I was aiming for :) [08:40:52] https://paste.debian.net/hidden/91171253 [08:41:02] so it's probably asw2-a-eqiad.mgmt.eqiad.wmnet [08:41:33] jayme: yeah, also not using --homer ? [08:41:39] I am [08:41:50] (using --run-homer) [08:42:15] jayme: I think it's just `--homer` - parser.add_argument('--homer', action='store_true', help='Use Homer to configure the switches') [08:42:24] oh...yeah. that then [08:43:16] since the message ends up in a phab comment it would be nice if it would contain instructions on what do to exactly so we don't have to try to parse that from extended log [08:44:07] especially since I have 7 of 44 failing with that error [08:44:09] tbh it shouldn't really fail, this is just a generic error message [08:44:35] XioNoX: do you know why it's tripping up here? the commands should work right? [08:44:57] /var/log/spicerack/sre/hosts/decommission-extended.log on cumin1003 if you're very curious [08:44:59] topranks: no idea, but I know the fix :) [08:45:12] what's that, running homer? [08:45:21] but be aware that there are 3 parallel decom runns since the cookbook is limited to 20 hosts [08:45:54] topranks: yeah [08:46:12] topranks: but I forgot if my implementation supports VC or not [08:47:29] yeah it makes no sense to me, looking at the homer diff for asw2-a-eqiad those commands should result in the same thing, there is nothing else being done it seems [08:51:30] topranks: yeah no VC support for that current code path - https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/netbox.py#L695 [08:52:11] ah ok gotcha [08:52:22] I feel like that as that point it's best to just wait for the last 4 VCs to be gone (eqiad A/B, ulsfo, eqsin), which should a be few months max now [08:52:56] what is a VC in this context? [08:53:21] jayme: virtual chassis, it's a virtual switch made of many real switches [08:53:34] ack [08:53:39] so the management endpoint is not the same hostname than the real switches [08:54:33] jayme: right now I think defaulting to using the `--homer` flag is probably the best advice. we've been phasing out those VCs we only have row A/B in eqiad and the two switches in Singapore left [08:54:34] I ran homer for asw2-b-eqiad.mgmt.eqiad.wmnet and there is no diff for asw2-a-eqiad.mgmt.eqiad.wmnet ... so I would assume I'm good? [08:55:01] jayme: yep, sorry I should have said I ran it against asw2-a-eqiad just now trying to see what was going on (hence no diff0 [08:55:13] ack [08:55:21] topranks: but --homer won't work with the VC :( as it uses that code path to figure out which switch to run homer on [08:55:25] FIRING: [2x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:35] and yes, I've updated our docs to use --homer and run homer manually on cr*DC aftet [08:55:55] hey elukey, you broke a thing :D [08:55:59] jayme: yeah sorry, but --homer won't run on eqiad rows A/B :( [08:56:33] so better to not use --homer (for now), and I'll look at implementing something to clear the BGP [08:56:42] ack [08:56:51] XioNoX: oh yeah ffs [08:57:41] XioNoX: in case you do I assume you will make it so it runs homer on cr* or ToR depending on the typology, right? [08:57:51] so we don't need to clean up manually [08:58:04] jayme: that's the idea [08:58:10] wonderful idea [08:58:12] love it! [08:58:14] :) [08:58:22] jayme: I was going to say skipping this might be best, it'll come in to us as a diff to fix up and thus back on our plate as it's kind of our fault :P [08:58:23] jayme: but it will only work on full moon days [08:58:50] but not if Saturn and Jupiter are aligned [08:59:07] XioNoX: thats fine, just make it the cookbook check the calendar please and make it fail if it's not full moon [08:59:13] or schedule via at [08:59:32] :) [08:59:45] now I wonder if at supports "full moon" [08:59:54] that would be a fun easter egg [09:00:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:22] jayme: yeah https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275818 [09:01:34] there was the zuul intermediate down below already using the port, missed it [09:02:26] ah, sneaky...it comes after the 200X ones [09:02:54] elukey: maybe move zuul up after puppet_rsa? [09:03:26] to make it less confusing next time [09:04:41] sure [09:05:29] jayme: another qq - IIUC the auth secret used by cfssl issuer is related to a specific cluster, so I can duplicate it for discovery2026 right? [09:05:43] in puppet private's hieradata/role/common/deployment_server/kubernetes.yaml [09:07:16] (patch updated) [09:10:02] elukey: yeah, I think each cluster has it's own key [09:12:16] but I would think you need to copy them in private puppet so that the existing secrets in k8s are valid for discovery and discovery2026 [09:12:55] then you just need to change the label in helmfile.d/admin_ng/cert-manager/cfssl-issuer-values.yaml to discovery2026 [09:17:20] yep yep my plan is to copy discovery -> discovery2026's settings in private, then change the yaml for public [09:18:22] jayme, topranks: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1275821 of course not tested [09:18:57] but the theory is that it should to all the right things [09:19:39] elukey: about https://phabricator.wikimedia.org/T418899#11842934 the last comment in the thread is what's exposed on a lot of dells and we need to see if it's there on idrac 10 too or not [09:23:40] lol, I read LEGACY_VOLANS on line 32 :D [09:25:18] hahahah [09:25:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:08] XioNoX: sorry I may have pasted the wrong link, what I thought I found was that on idrac 10 there seems no oem-like way to set the broadcom's lldp settings [09:26:19] the only option seems to be for the BMC's nic [09:26:32] I think they may have removed it [09:26:48] ah, ok :( [09:27:22] I have dumped the scp settings on a file, I'll ask Jenn to look around, if there is something we may get lucky [09:30:25] FIRING: [3x] SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:47] jayme: thx for the review, would you be down to test the gerrit CR? [09:33:10] XioNoX: I would but I don't have any more hosts to decom rn [09:33:26] jayme: sounds good, let me know when you do! [09:34:54] I'll go and find some that ideally peer with cr* and note it on the task [09:39:00] XioNoX: I fear this is not going to happen soon :/ [09:39:36] jayme: no pb, it can wait [09:42:43] there is not even anything planned for refresh that peer with cr* ... so it will probably get forgotten by then [09:43:43] it will be in Q3 .. [09:45:16] jayme: any decom would help to test the change [09:46:04] well, any I can do earlier :) [09:50:25] RESOLVED: SystemdUnitFailed: cfssl-ocspserve@discovery2026.service on pki2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:47] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11843281 (10FCeratto-WMF) In zarcillo we have the relation `host <-> role <-> rack` and we can label replicas and candidates as depoolable (but not primary/DC masters). We can u... [10:28:26] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11843361 (10elukey) On iDRAC 10 I see the following: {F77111523} [11:06:09] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11843463 (10elukey) Applied and took the SCP dump, diffed with its previous config, nothing stands out. It seems that we are not able anymore to disable LLDP v... [12:52:09] I think I finally got the cfssl issuer config for k8s staging: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1275812 [14:11:05] jhathaway: I wonder what's the way forward with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1266205. I still don't understand why exactly that is failing in the way it is [14:12:27] yeah, its mysterious, I can take a go at reproducing it taavi [14:43:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11844857 (10Papaul) [15:29:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Standardize management routers interfaces - https://phabricator.wikimedia.org/T421674#11845364 (10ayounsi) [18:26:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed