[00:13:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:33] 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10ayounsi) [11:50:37] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [11:52:42] 10netbox, 10Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843 (10ayounsi) Thanks for having a look! I boldly made it a dependency of the Netbox 3.6 upgrade as we're running an ancient Netbox version. We can revisit if... [12:03:30] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:41] topranks: Juniper doesn't have recommended versions anymore. Now it's "Suggested Releases to Consider and Evaluate" [12:13:17] XioNoX: heh yeah I seen that alright, some wordplay from their legal dept. I assume [12:14:00] for sure [12:18:30] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:03] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [12:45:16] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.6.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [13:13:47] XioNoX: our favourite puppet ordering bugs are back for anycast stuff :] [13:13:50] https://puppetboard.wikimedia.org/report/durum4001.ulsfo.wmnet/b355d6f82522880bcfd616448325973b836dd1e0 [13:13:57] Could not set 'directory' on ensure: Could not find user bird [13:14:08] uh [13:14:10] annoying [13:14:14] yeah, really [13:14:25] especially since that nothing changed but these affect both bullseye and bookworm [13:14:31] we don't have the proper dependency chain in puppet? [13:14:38] or is it related to the package itself? [13:14:53] no, dependency chain seems to be missing, we thought we fixed it and that was the end of it but yeah, clearly not [13:14:59] change from 'absent' to 'directory' failed: Could not set 'directory' on ensure: Could not find user bird [13:15:10] bird2 package creates the bird user, so that needs to happen first [13:15:13] will patch it soon [13:15:49] yeah package install first I guess is the way. you could also create the user from puppet beforehand if needed. [13:16:33] topranks: in theory the package install should be happening first but yeah, we need more strict ordering here [13:31:53] topranks: https://gerrit.wikimedia.org/r/c/operations/software/homer/+/947352 :) I figured it could help for the new deployment (even if just local cherry-pick). volans I couldn't figure out how to fix the tests though [13:34:47] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) a:03ayounsi [13:34:52] XioNoX: awesome! [13:35:16] yeah that is definitely a real pain point when working on templates and other homer changes [13:36:09] agreed [13:42:19] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) I sent a new email to Juniper yesterday to ask again about the best next steps here. [13:44:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Resolved→03Declined [13:44:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [13:44:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [13:45:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Boldly closing this as Katran will solve some if not all those limitations. [13:48:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can it route it to the proper source host? [14:17:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10Vgutierrez) >>! In T253732#9080504, @ayounsi wrote: > @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can... [14:19:28] 10SRE-tools, 10Cloud-VPS, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) [14:19:37] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) p:05Triage→03Low [14:36:06] let's work on bringing nsa.wikimedia.org to prod :) [14:38:28] sukhe: I forgot about that but yeah, if removing ns2 is more complex than changing the IP we can change it to that IP, but ideally we remove then re-add [14:38:49] topranks: when we depool the site we can also remove any anycast prefixes from esams [14:38:54] esams/knams [14:40:12] yep makes sense [14:40:42] just remove them from bgp_out in sites.yaml [14:42:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) 05Open→03Declined Thanks, then like {T253666} I'm going to boldly close this task. [14:42:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [14:43:44] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) 05In progress→03Resolved > I'll leave the last page to you. That last page was https://wikitech.wikimedia.org/wiki/Wikim... [14:43:50] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:44:02] 10netops, 10Infrastructure-Foundations, 10SRE: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) 05Open→03Resolved a:03ayounsi We have a working solution for the mgmt network (until it's time to split mgmt into smaller subnets). And for production, automation and per... [14:47:32] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) [14:47:59] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Epic, and 2 others: Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) 05In progress→03Resolved [14:51:44] volans: if current esams is unreachable how would automation behave to cleanly decom servers/VMs from monitoring/netbox, etc? [15:26:15] that's a very good question [15:27:56] I guess worse case it's remove from puppet site.pp (and other repos as usual) then manually edit netbox [15:29:52] I was wondering if we shouldn't start to decom servers (like netflow, doh, etc) on friday too [15:30:04] (after the final depool) [15:30:06] yeah exactly [15:30:26] I guess we didn't really want to do that "in case" we had to roll back for some reason [15:30:40] in theory, we can wait for those right? because they won't be serving any traffic, so maybe we can set the site up first and do all these removals later [15:30:45] just thinking on how tmake it easy [15:30:48] but if they'll be unreachable post-Sunday we probably best [15:31:05] We *could* take cr2-esams and cr3-esams down Sunday [15:31:19] the issue is cr2-esams's linecard [15:31:30] And reconfigure cr3-esams ae1 as the gateway for the esams private/public vlans [15:32:07] sounds like a lot of changes for a sunday :) [15:32:10] i.e. add ae1.100/ae1.103 interfaces to it, and add those vlans to the trunk from esams asws [15:32:12] yeah [15:32:19] sorry in a meeting will answer after [15:32:47] Right now I'm assuming cr2's linecard won't come back up [15:33:18] topranks: appreciate the clear email writeup! [15:33:19] #if the idea behind that is to keep esams vlans reachable so we don't have to decom in advance, in case we need to roll-back [15:33:43] then that roll back gets very messy if we've disconnected cr2 already [15:34:35] XioNoX: same - wouldn't be surprised if it comes back, but operating on the assumption it won't [15:35:25] XioNoX: sry (as if above wasn't confusing enough) - what I was suggesting is make cr3-knams gateway for the current esams vlans [15:35:46] I'm not persuaded it's a good idea but it's an option maybe [15:36:31] yeah that's what I had in mind in case we needed to keep both sites running in parallel [15:36:56] the fact the WAN transports land on the esams asw's makes it not too hard to pull off if needed [15:37:06] but probably better to decom esams at the end, as long as it doesn't cause issues with provisionning new-esams [15:38:20] yeah it probably is better to not decom those servers before we provision new-esams [15:38:47] let's see what volans says, but I suspect you're right that they'll need to be reachable to decom [15:39:54] making cr3-knams the gateway for them on Sunday might not be so bad [15:43:46] topranks: XioNoX: rob is working today so we were able to get the email to markmonitor out already [15:44:02] awesome! [16:11:02] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Opened high priority case 2023-0809-747283 asking for a RMA. [16:18:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:33] sorry, I'm back, so what's the problem? [16:22:38] reading backlog [16:23:01] volans: tl;dr; old-esams might be fully down starting sunday [16:23:14] this sunday? [16:23:18] volans: ya [16:23:23] because of a core router issue? [16:23:29] volans: yep [16:23:35] great timining :D [16:23:57] so, if the host is not reachable the partition table is not wiped [16:24:10] beside from that the decom works also if the host is on fire [16:24:16] cool [16:24:22] volans: same for vms? [16:24:42] that's untested, because in that case means that the ganeti master is unreachable right? [16:26:25] let me check the code [16:27:18] yep [16:29:06] so yeah how can you delete a ganeti vm if the ganeti cluster is unrechable? :D [16:29:36] XioNoX: and wouldn't they become reachable via knams? [16:30:41] volans: nah [16:30:47] last time we see them maybe [16:31:01] say goodbay to them from me too [16:31:01] :D [16:31:46] topranks: one thing to do if they go down, remove install3xxx from puppetdb so that the reimage will fallback to drmrs [16:31:52] or will try to use it and fail [16:33:05] volans: ok noted thanks [16:33:23] I will pass on your warm regards :) [16:33:52] :D [19:33:27] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) RMA in progress, Juniper happy with address for replacement and staff at destination are aware of delivery. I will decom the existing faulty card on Sunday when on site and prep... [20:18:30] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed