[06:27:56] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10hashar) 05Open→03Resolved That one has been solved after I have found... [07:16:53] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10ayounsi) p:05Triage→03Low [07:39:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10ayounsi) Thanks for the feedback! > Weighing this against the costs of maintaining them properly, that's the big question here. Indeed :) I opened... [07:53:58] 10netops, 10Infrastructure-Foundations, 10SRE: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Resolved→03Open Re-opening as there are some EVPN elements outside the 'protocols bgp' context that also need to be added. Will submit patch. [08:08:22] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10cmooney) That codfw error is interesting actually, it makes me wonder why we have the "no-resolve" command on those routes? Without that the error wo... [08:13:22] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:27] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) I don't manage that spreadsheet, so I have no idea :) If that doesn't work we can easily switch to do the match on the Serial number column, that seems hardcoded for... [09:31:59] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) FYI there is already a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/restart-pybal.... [12:38:07] hi all :) I want to reimage a host that had a failed reimage attempt before (stuck in debian installer). I get the following error: [12:38:08] RuntimeError: Host gitlab2003.wikimedia.org was not found in PuppetDB but --new was not set. [12:38:08] Is it fine to run the cookbook with the --new flag then? [12:38:30] hey jelto, yes that's ok [12:38:49] it means that the host has been in a failed state for more than 2 weeks [12:39:02] sorry, ignor emy last line [12:39:25] at the start of a reimage the host is removed from puppet/puppetdb [12:39:34] so if it fails the host is not anymore in puppet/puppetdb [12:40:05] thanks! I'll try the reimage with --new then. [12:53:31] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Clement_Goubert) FWIW, the cookbook can be used, but it needs to be given the actual lvs servers to run on. Assuming `lvs1020` and `lvs2010` are secondaries, `lvs1... [13:04:22] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) Thanks for the clarification @Clement_Goubert [14:57:35] 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10cmooney) Just to note, the new Juniper QFX5120 devices in Eqiad (lsw1-e[5-7]-eqiad / lsw1-f[5-6]-eqiad) had been marked down as `QFX5100-48S-6Q`. Matching t... [15:05:48] 10netops, 10Infrastructure-Foundations: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10cmooney) p:05Triage→03Medium [15:08:37] 10netops, 10Infrastructure-Foundations: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) [15:09:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) >>! In T292095#8715082, @Jclark-ctr wrote: > @cmooney Racks e5-7 f5-7 have been cabled and racked do you want to use same ticket f... [15:09:19] 10netops, 10Infrastructure-Foundations: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) p:05Triage→03Low [15:41:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) [15:51:28] 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10cmooney) Another thing we'll need to tidy up here is the LibreNMS report that checks what it sees from SNMP with the info from Netbox. {F36941647} Given the... [17:15:00] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Would it be possible to use both serial number and/or asset tag for the match? I'll follow up with Julianne (she's currently out) regarding the formula being us... [17:16:30] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) But if the asset tag comes from netbox it will not match anything for future hosts... as the host will not be anymore in Netbox :) [17:19:48] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy I've sent the above patch to match on serial instead of asset tag. LMK what do you want to do. [17:32:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7d20917-1f70-4c85-bea4-4fae89694441) set by cmooney@cumin1001 f... [17:33:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=09fdc8d3-92d3-4c3b-8e46-8c1befa6a846) set by cmooney@cumin1001 f... [17:48:22] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:54] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) 05Open→03Resolved Thanks @Volans. It looks like we're all set now. https://netbox.wikimedia.org/extras/reports/results/4443574/ [18:03:22] (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:22] (SystemdUnitFailed) firing: (3) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:41] (SystemdUnitFailed) resolved: (3) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:05] hi folks [18:25:11] I know Netbox is bad with restoring stuff [18:25:29] but I wanted to delete an older interface and ended up removing lvs3007 instead somehow (genuinely not sure how) [18:25:44] anyway, it's a secondary so not a big deal in that sense as it will break prod or anything [18:25:52] but is there any way to restore it? [18:27:12] sukhe: I believe the only way is restoring a backup [18:28:41] there seems like there is also a changelog option, https://wikitech.wikimedia.org/wiki/Netbox#Restore [18:29:20] I managed to recreate the host and import from PuppetDB [18:29:23] mostly looks good [18:29:28] though I am missing a few things [18:29:31] jhathaway: thanks checking [18:30:06] np [18:40:46] ok I created it manually fo rnow [18:40:58] and reimported most of the stuff from PuppetDB [18:41:12] I had a screenshot of the host as I always take them and I guess they came in handy [18:42:27] just need to find a way to restore the cable and connection data for the mgmt interface [18:48:37] essentially looking to restore from 113125 [18:48:42] 61f8a638-5ca2-4506-bfb6-66808e5a1ef0 [18:48:42] onwards [18:48:52] nod [18:49:54] our docs say you can do that via the api by applying in reverse order, but I have never tried that, nor do the docs provide much guidance [18:50:15] yeah I think the restore of the db dump is what I have seen in the past [18:51:30] sukhe: if you think that is necessary I'm happy to try and raise some folks, but I would be wary of trying it solo [18:52:04] jhathaway: yeah me too [18:52:10] I think this is unlikely going to break anything [18:52:15] and is not but I just want to clean it [18:52:29] I did ping vol.ans, let's see if someone else is around [18:52:30] appreciate the help [18:54:58] is it just the mgmt interface that needs restored? [18:55:07] (or the cables for such, rather) [18:55:09] just management yep [18:55:17] everything else I did it, well minus the procurement ticket and such [18:55:29] I had a screenshot of the other stuff because I knew this would happen one day :P [18:56:00] I don't think that should affect anything, aside from maybe some automated (non-paging) alerts, so feel free to just file a ticket with details and let one of the Netbox experts sort it out in business hours [18:56:19] ok thanks [18:56:22] yeah I guess that makes sense [18:56:39] sigh! genuinely don't know how I ended up from deleting a redudant interface to deleting the host [18:56:45] I definitely misinterpreted the message [18:56:47] or muscle memory [18:56:50] not sure but well [18:56:51] it happens :) [18:57:32] seems like netbox should have very confirmation screens for deletion [18:57:40] *scary confirmation [18:57:44] in this case, it did have a confirmation [18:57:49] and I interpreted it wrongy [18:57:53] I thought it was deleting the interface [18:58:24] ah, so perhaps *better confirmation screens* then! [19:03:33] 10netbox, 10Infrastructure-Foundations, 10Traffic-Icebox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10ssingh) [19:10:48] ok so I fixed the only alert left I think [19:10:54] by going over the changelog [19:11:01] rack: 31 [19:11:14] if you see the HTTP request it makes, this is OE16 [19:12:08] { "2": { "id": 31, "url": "https://netbox.wikimedia.org/api/dcim/racks/31/", "display": "OE16", "name": "OE16", "device_count": 17 } } [19:12:11] ok great [19:12:20] cool, it recovered [19:12:30] sorry again folks but hopefully no more alerts for now, even though they were non-paging [20:03:44] 10netbox, 10Infrastructure-Foundations, 10Traffic-Icebox: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10ssingh) I have also made sure that there are no pending DNS changes by manually adding the DNS entries. That just leaves us with the cabling.