[00:13:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:13:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:50:33] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10ayounsi)
[11:50:37] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi)
[11:52:42] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10ayounsi) Thanks for having a look! I boldly made it a dependency of the Netbox 3.6 upgrade as we're running an ancient Netbox version. We can revisit if...
[12:03:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:41] <XioNoX>	 topranks: Juniper doesn't have recommended versions anymore. Now it's "Suggested Releases to Consider and Evaluate"
[12:13:17] <topranks>	 XioNoX: heh yeah I seen that alright, some wordplay from their legal dept. I assume 
[12:14:00] <XioNoX>	 for sure
[12:18:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:45:03] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi)
[12:45:16] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi)
[13:13:47] <sukhe>	 XioNoX: our favourite puppet ordering bugs are back for anycast stuff :]
[13:13:50] <sukhe>	 https://puppetboard.wikimedia.org/report/durum4001.ulsfo.wmnet/b355d6f82522880bcfd616448325973b836dd1e0
[13:13:57] <sukhe>	 Could not set 'directory' on ensure: Could not find user bird
[13:14:08] <XioNoX>	 uh
[13:14:10] <XioNoX>	 annoying
[13:14:14] <sukhe>	 yeah, really
[13:14:25] <sukhe>	 especially since that nothing changed but these affect both bullseye and bookworm
[13:14:31] <XioNoX>	 we don't have the proper dependency chain in puppet?
[13:14:38] <XioNoX>	 or is it related to the package itself?
[13:14:53] <sukhe>	 no, dependency chain seems to be missing, we thought we fixed it and that was the end of it but yeah, clearly not
[13:14:59] <sukhe>	 change from 'absent' to 'directory' failed: Could not set 'directory' on ensure: Could not find user bird
[13:15:10] <sukhe>	 bird2 package creates the bird user, so that needs to happen first
[13:15:13] <sukhe>	 will patch it soon
[13:15:49] <topranks>	 yeah package install first I guess is the way.  you could also create the user from puppet beforehand if needed.
[13:16:33] <sukhe>	 topranks: in theory the package install should be happening first but yeah, we need more strict ordering here
[13:31:53] <XioNoX>	 topranks: https://gerrit.wikimedia.org/r/c/operations/software/homer/+/947352 :) I figured it could help for the new deployment (even if just local cherry-pick). volans I couldn't figure out how to fix the tests though
[13:34:47] <wikibugs>	 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) a:03ayounsi
[13:34:52] <topranks>	 XioNoX:  awesome!  
[13:35:16] <topranks>	 yeah that is definitely a real pain point when working on templates and other homer changes 
[13:36:09] <XioNoX>	 agreed
[13:42:19] <wikibugs>	 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) I sent a new email to Juniper yesterday to ask again about the best next steps here.
[13:44:43] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Resolved→03Declined
[13:44:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[13:44:59] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[13:45:13] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Boldly closing this as Katran will solve some if not all those limitations.
[13:48:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can it route it to the proper source host?
[14:17:18] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10Vgutierrez) >>! In T253732#9080504, @ayounsi wrote: > @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can...
[14:19:28] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri)
[14:19:37] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) p:05Triage→03Low
[14:36:06] <sukhe>	 let's work on bringing nsa.wikimedia.org to prod :)
[14:38:28] <XioNoX>	 sukhe: I forgot about that but yeah, if removing ns2 is more complex than changing the IP we can change it to that IP, but ideally we remove then re-add
[14:38:49] <XioNoX>	 topranks: when we depool the site we can also remove any anycast prefixes from esams
[14:38:54] <XioNoX>	 esams/knams
[14:40:12] <topranks>	 yep makes sense 
[14:40:42] <topranks>	 just remove them from bgp_out in sites.yaml
[14:42:32] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) 05Open→03Declined Thanks, then like {T253666} I'm going to boldly close this task.
[14:42:48] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[14:43:44] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) 05In progress→03Resolved > I'll leave the last page to you.  That last page was https://wikitech.wikimedia.org/wiki/Wikim...
[14:43:50] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri)
[14:44:02] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) 05Open→03Resolved a:03ayounsi We have a working solution for the mgmt network (until it's time to split mgmt into smaller subnets). And for production, automation and per...
[14:47:32] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri)
[14:47:59] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Epic, and 2 others: Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) 05In progress→03Resolved
[14:51:44] <XioNoX>	 volans: if current esams is unreachable how would automation behave to cleanly decom servers/VMs from monitoring/netbox, etc?
[15:26:15] <topranks>	 that's a very good question 
[15:27:56] <XioNoX>	 I guess worse case it's remove from puppet site.pp (and other repos as usual) then manually edit netbox
[15:29:52] <XioNoX>	 I was wondering if we shouldn't start to decom servers (like netflow, doh, etc) on friday too
[15:30:04] <XioNoX>	 (after the final depool)
[15:30:06] <topranks>	 yeah exactly 
[15:30:26] <topranks>	 I guess we didn't really want to do that "in case" we had to roll back for some reason 
[15:30:40] <sukhe>	 in theory, we can wait for those right? because they won't be serving any traffic, so maybe we can set the site up first and do all these removals later
[15:30:45] <sukhe>	 just thinking on how tmake it easy
[15:30:48] <topranks>	 but if they'll be unreachable post-Sunday we probably best 
[15:31:05] <topranks>	 We *could* take cr2-esams and cr3-esams down Sunday 
[15:31:19] <XioNoX>	 the issue is cr2-esams's linecard
[15:31:30] <topranks>	 And reconfigure cr3-esams ae1 as the gateway for the esams private/public vlans 
[15:32:07] <XioNoX>	 sounds like a lot of changes for a sunday :)
[15:32:10] <topranks>	 i.e. add ae1.100/ae1.103 interfaces to it, and add those vlans to the trunk from esams asws
[15:32:12] <topranks>	 yeah 
[15:32:19] <volans>	 sorry in a meeting will answer after
[15:32:47] <XioNoX>	 Right now I'm assuming cr2's linecard won't come back up
[15:33:18] <sukhe>	 topranks: appreciate the clear email writeup!
[15:33:19] <topranks>	 #if the idea behind that is to keep esams vlans reachable so we don't have to decom in advance, in case we need to roll-back 
[15:33:43] <topranks>	 then that roll back gets very messy if we've disconnected cr2 already 
[15:34:35] <topranks>	 XioNoX: same - wouldn't be surprised if it comes back, but operating on the assumption it won't
[15:35:25] <topranks>	 XioNoX: sry (as if above wasn't confusing enough) - what I was suggesting is make cr3-knams gateway for the current esams vlans 
[15:35:46] <topranks>	 I'm not persuaded it's a good idea but it's an option maybe 
[15:36:31] <XioNoX>	 yeah that's what I had in mind in case we needed to keep both sites running in parallel
[15:36:56] <topranks>	 the fact the WAN transports land on the esams asw's makes it not too hard to pull off if needed 
[15:37:06] <XioNoX>	 but probably better to decom esams at the end, as long as it doesn't cause issues with provisionning new-esams
[15:38:20] <topranks>	 yeah it probably is better to not decom those servers before we provision new-esams 
[15:38:47] <topranks>	 let's see what volans says, but I suspect you're right that they'll need to be reachable to decom 
[15:39:54] <topranks>	 making cr3-knams the gateway for them on Sunday might not be so bad 
[15:43:46] <sukhe>	 topranks: XioNoX: rob is working today so we were able to get the email to markmonitor out already
[15:44:02] <topranks>	 awesome!
[16:11:02] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Opened high priority case 2023-0809-747283 asking for a RMA.
[16:18:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:22:33] <volans>	 sorry, I'm back, so what's the problem?
[16:22:38] <volans>	 reading backlog
[16:23:01] <XioNoX>	 volans: tl;dr; old-esams might be fully down starting sunday
[16:23:14] <volans>	 this sunday?
[16:23:18] <XioNoX>	 volans: ya
[16:23:23] <volans>	 because of a core router issue?
[16:23:29] <XioNoX>	 volans: yep
[16:23:35] <volans>	 great timining :D
[16:23:57] <volans>	 so, if the host is not reachable the partition table is not wiped
[16:24:10] <volans>	 beside from that the decom works also if the host is on fire
[16:24:16] <XioNoX>	 cool
[16:24:22] <XioNoX>	 volans: same for vms?
[16:24:42] <volans>	 that's untested, because in that case means that the ganeti master is unreachable right?
[16:26:25] <volans>	 let me check the code
[16:27:18] <XioNoX>	 yep
[16:29:06] <volans>	 so yeah how can you delete a ganeti vm if the ganeti cluster is unrechable? :D
[16:29:36] <volans>	 XioNoX: and wouldn't they become reachable via knams?
[16:30:41] <XioNoX>	 volans: nah
[16:30:47] <XioNoX>	 last time we see them maybe
[16:31:01] <volans>	 say goodbay to them from me too
[16:31:01] <volans>	 :D
[16:31:46] <volans>	 topranks: one thing to do if they go down, remove install3xxx from puppetdb so that the reimage will fallback to drmrs
[16:31:52] <volans>	 or will try to use it and fail
[16:33:05] <topranks>	 volans: ok noted thanks 
[16:33:23] <topranks>	 I will pass on your warm regards :)
[16:33:52] <volans>	 :D
[19:33:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) RMA in progress, Juniper happy with address for replacement and staff at destination are aware of delivery.  I will decom the existing faulty card on Sunday when on site and prep...
[20:18:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed