[00:17:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) p:05Triage→03High [00:21:38] 10netops, 10Infrastructure-Foundations, 10SRE: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) p:05Triage→03Medium [00:56:36] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10ori) We got alerts about the Beta Cluster cert being close to expir... [01:35:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10cmooney) a:03cmooney Thanks @ayounsi > One surprising point though is that the path through the... [01:35:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) [02:36:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10cmooney) I ended up issuing this command: ` request app-engine service restart packet-forwarding-engin... [04:11:38] (LVSHighCPU) firing: (2) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [04:16:38] (LVSHighCPU) resolved: (7) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [06:00:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 (10RKemper) (Following is just related to bringing these hosts back into service) Pooled the hosts: ` r... [08:39:10] topranks: Thanks for the fix to T315038 ! [08:39:11] T315038: Overlay VRF / VXLAN traffic failure between lsw1-f2-eqiad and lsw1-f3-eqiad - https://phabricator.wikimedia.org/T315038 [09:17:51] gehel: no problem, there was a bit of luck in it but it was a serious issue. [09:18:19] yeah, not a big deal on our side, but I could understand how this would be serious! [09:18:23] Probably a bug triggered at some time during the installation and testing in those racks. We never specifically tested traffic between them. [09:18:55] I’ve seen occasional bugs like this in the past - where switch hardware gets out of sync with the control plane [09:19:17] I’d be confident enough it’s a very rare occurrence and wouldn’t be worried about a repeat [09:19:37] But I’ll chase with Juniper, I have some logs etc from the time. They may have some recommendations [09:33:57] Good luck! [09:34:13] Chasing down vendors has never been fun in my experience! [09:49:57] 10Traffic, 10SRE, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) [10:09:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) > Netbox drives the infrastructure, and not the other way around. Fully agree that's best. But unfortunate... [10:12:29] 10netops, 10Infrastructure-Foundations, 10SRE: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) Having thought about it in more detail I think it's best to keep the multihop for the iBGP EVPN sessions. Reason being that even if a Leaf loses a Spine lin... [10:13:27] 10netops, 10Infrastructure-Foundations, 10SRE: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10cmooney) @ayounsi be interested if you've any thoughts on that. [12:18:03] 10netops, 10Infrastructure-Foundations, 10SRE: Enable OSPF Icinga check for EVPN based switches - https://phabricator.wikimedia.org/T315053 (10ayounsi) yeah I agree +1 on having a stable iBGP capable of handling link failure. The OSPF adjacency check should be used but IIRC it assumes there are as many v4 s... [12:30:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > Would a cookbook be an idea possibly? That we could run ourselves to update a specific network port to mat... [14:41:14] 10Traffic, 10SRE, 10Patch-For-Review: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) [15:03:58] https://gerrit.wikimedia.org/r/c/operations/dns/+/816053 (geodns settings for asia) has been approved by sukhe but I don't know if any other team members should take a look before merging in, particularly since it alters some large countries [15:07:32] eh, another lookthrough makes me pretty confident in these changes. The larger country changes are not really all that big a deal since they're in the back of the queue [15:13:31] bblack: I'm pretty confident in deploying these changes. Are you around on the unlikely chance that something were to happen? [15:13:33] I can take a peek! [15:15:35] yeah looks good, and yes I'm here [15:15:43] brett: ^ [15:15:45] excellent. Thanks so much [15:16:28] it's interesting how some of the diffs highlight future work we could do, but that's not important for the deploy of this change, which is a net win :) [15:17:24] basically, there's some evidence in there that we really should do a global run (all DCs against all targets), and then also look at the "core DCs" sublist and how it's placed too [15:17:31] an example: [15:17:34] BD => [eqsin, ulsfo, codfw, eqiad, drmrs, esams], # Bangladesh [15:17:58] ^ the change here flipped drmrs and esams at the end based on the new latency results [15:18:15] but basically, there's no way that ulsfo or the core DCs are beating drmrs/esams on latency here [15:18:59] it should probably really be [eqsin, drmrs, esams, ulsfo, eqiad, codfw], but we don't really have data to back that up without a full run. [15:19:37] absolutely, and it was harder to do this without the other DCs! [15:20:27] (and for the very common cases where the core DCs are not at the end of the list, it might make sense to just terminate the list at the core DCs, modulo questioning whether that's ok operational practice... that we'd never depool the edge at both cores simultaneously, especially simultaneous to another edge in front of them also being out of the active set) [15:21:43] seems reasonable [15:22:34] yeah we have vague notions of policy and design constraints, like "Surely we'd never depool >2/N edges simultaneously", but we could stand to try to write that down as policy that we can design around. [15:23:04] shorter DC lists would mean more countries have matching sets (by removing the trailing differences that probably never matter) [15:23:34] more matching sets means the network-mapping optimizer can find more adjacent subnets with identical results to merge into larger supernets [15:23:57] which in turn increases cache hitrate at resolvers which implement edns-client-subnet (like Google Public DNS, etc) [15:26:45] the amount of such merging is reported in startup and config reload log outputs, e.g. [15:26:55] authdns1001 gdnsd[18312]: plugin_geoip: map 'generic-map' runtime db updated. nets: 1169959 dclists: 14 [15:28:09] ^ means there's 14 unique datacenter lists in our current map, and that the merging/optimizing process reduced the global IP space down to ~1.2M subnets where the result changed at some boundary between neighbors. [15:28:58] I smell some set theory [15:30:17] Fortunately, I know a good mathematician should we need one :^) [15:30:29] :P [15:30:29] In any case, authdns-update has been run [15:30:53] should take ~10 minutes for any graph impact to roll in, as the TTLs time out in various caches [15:33:37] the same basic code is also used for vmod_netmapper that does our cloud subnet lists and such in varnish (it was copied over from a much older version of gdnsd) [15:49:42] https://w.wiki/5ZhW <- can see the drmrs bump from that change here [15:54:55] ah dang [15:54:58] nice! [16:04:53] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BCornwall) [16:05:26] 10Traffic, 10SRE, 10Patch-For-Review: DRMRS: Geodns Configuration -- Phase 2 - https://phabricator.wikimedia.org/T311472 (10BCornwall) 05In progress→03Resolved Changes have been deployed for all three continents! [17:54:31] those phrases that start with "we would never..." are a bit scary, since they may end up happening :D [18:42:57] 10Acme-chief, 10SRE, 10Traffic-Icebox: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 (10BCornwall) a:03Vgutierrez @Vgutierrez since this was merged, can this ticket be closed?