[08:44:28] topranks: do you have some time to discuss T379164? [08:44:28] T379164: BGP settings for liberica - https://phabricator.wikimedia.org/T379164 [08:45:06] we want to replace pybal with liberica in ulsfo and for that I need the equivalent BGP communities [08:53:26] moritzm: while working on https://gerrit.wikimedia.org/r/c/operations/alerts/+/1110843 we've detected that ldap-replica[1003,1004] often struggle accepting new connections: https://grafana.wikimedia.org/goto/mWl84MONR?orgId=1 [08:54:47] this also matches pybal healthchecks errors [08:54:53] https://www.irccloud.com/pastebin/X0zyepM3/ [08:59:39] thanks, will have a look [09:02:32] FWIW MSS is measured from ldap-replica instances.. so it's having issues even locally [09:02:43] we also notices that codfw instances aren't impacted [09:04:59] yeah, the usage in eqiad is always much higher since we only have prod cloud there [09:28:01] vgutierrez: hey yep happy to discuss the Liberica BGP stuff anytime [09:28:31] topranks: the schema proposed in T354839 would work [09:28:31] T354839: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 [09:29:39] ok, so in this patch we added a new community - SELECTED_PATH [09:29:40] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1084760 [09:29:48] Which gives us 3 levels: [09:29:51] - higher than normal [09:29:52] - normal [09:29:56] - lower than normal [09:30:11] T354839 is slightly different, but ultimately probably allows for the same kind of control [09:30:16] it gives us: [09:30:18] - normal [09:30:22] - lower than normal [09:30:26] - even lower / don't use [09:30:45] so normal is no community string, right? [09:31:05] yes [09:31:12] ack [09:31:19] making local-pref=100 [09:31:26] the others either skew that up or down [09:31:42] liberica BGP peers in ulsfo should be ToR or the core routers? [09:31:58] in ulsfo they should be the core routers, switches there are L2 only [09:32:06] cool [09:34:02] this is the current policy, you can see the preference is set in lines 5/6 and 8/9: [09:34:03] https://phabricator.wikimedia.org/P72152 [09:35:18] a route that doesn't match either community gets default local-pref of 100 [09:35:48] numbers are confusing [09:35:54] MED 0 -> local-pref 100 [09:36:08] MED 100 -> local-pref 50 [09:36:25] so the lower the MED the higher local-pref it gets [09:36:40] in terms of local pref, higher == more priority, right? [09:37:05] we love to make things confusing in the networking world :P [09:37:25] local preference should be thought of as the "route priority" - the one with highest number wins [09:37:38] MED should be thought of as "cost to reach destination" - the lowest one wins [09:42:21] so yeah you are correct [09:42:53] where is AVOIDED_PATH community string defined? [09:43:09] templates/cr/policy-options.conf:community AVOIDED_PATH members {{ asn }}:0; [09:43:13] is that the definition? [09:43:26] it's in a different section of the config [09:43:29] set policy-options community AVOIDED_PATH members 14907:0 [09:43:29] set policy-options community SELECTED_PATH members 14907:11 [09:43:55] yeah that's it in our homer tempalte - "asn" is basically our public asn always in these cases [09:44:12] right.. I always think of BGP communities as opaque strings [09:44:29] it'll likely move from that template soon as we need to make it vendor neutral to support nokia [09:44:33] the asn pattern is standard? [09:44:39] yeah always [09:44:45] basically they are 32-bit number [09:44:58] and the convention is upper 16-bits is your ASN, then a colon, then the number [09:45:12] given they are used on the internet and can traverse different ASNs that's the convention [09:45:47] intresting [09:45:56] how that works with 32 bits ASNs? [09:46:21] it doesn’t :D [09:46:34] 🤯 [09:46:55] they have this now though so it does [09:46:59] https://datatracker.ietf.org/doc/html/rfc8092 [09:47:26] I’m not 100% how widespread support has gotten. Probably fairly ok these days. [09:47:36] it was a problem for some years for those networks [09:49:07] so no community set for the primaries in ulsfo and 14907:0 for the secondary [09:51:11] yes and 14907:11 to get higher preference than normal [09:51:44] but you let us know what set of controls are most useful to you we can adjust the policy as needed [09:52:25] topranks: [09:52:28] so far it should be enough AFAIK [09:53:19] actually for reimaging purposes it would make sense to have a local-pref 25 or 10 community [09:53:46] cause while I'm reimaging a primary LVS I don't want the traffic flipping from the secondary to that lvs [09:54:18] so temporarily setting the priority to something lower than the secondary LVS (14907:0 / local-pref 50) makes sense IMHO [09:56:37] WDYT? [09:57:49] yeah I'm cool with it - that was the idea in T354839 [09:57:49] T354839: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 [09:58:00] and I think the use-case me and Sukhbir were discussing that led to that [09:58:38] cool, but since then 14907:0 got was defined there as 14907:1 :) [09:58:56] yeah give me a few mins to work out the right naming and numbering for the communities [09:59:04] awesome, thx <3 [09:59:13] I think AVOIDED_PATH / 14907:0 should stay the lowest [09:59:21] we will introduce a BACKUP_PATH as per the task [09:59:23] topranks: start looking for a nice place to get beers in ATL ;P [09:59:27] for the regular backup node [09:59:36] and avoided path can be used for the type of thing you mention above [09:59:41] perfect [09:59:43] haha sure thing :P [10:49:09] 06Traffic: Get a WMF/SRE/Traffic GCP account - https://phabricator.wikimedia.org/T376477#10475674 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [10:54:04] 06Traffic: issue unified cert using pki.goog - https://phabricator.wikimedia.org/T384195 (10Vgutierrez) 03NEW [12:27:01] vgutierrez: hey I put you down for a review on the patch to implement that [12:27:02] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1112734 [12:27:32] with Arzhel out there isn't really anyone else to green-light it fully in terms of JunOS so I guess you'll need to trust me :) [12:34:00] 10netops, 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: BGP settings for liberica - https://phabricator.wikimedia.org/T379164#10475926 (10cmooney) Ok as per the above patch the following communities can be set by Liberica, or will be set based on MED if route coming from PyBal |Community|Na... [12:34:34] vgutierrez: This would give us 4 potential priority levels, see here: [12:34:35] https://phabricator.wikimedia.org/T379164#10475926 [13:00:32] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#10475985 (10cmooney) >>! In T354839#10271470, @Vgutierrez wrote: > Gven the limitations to run pybal and liberica on t... [14:52:20] Hey guys... Joe Biden here. I've decided to step down from the White House to focus on other projects. Billionaires are a threat to democracy, so check out https://BidenCash.st to put them in the bullseye. Keep an eye on the CNN inauguration for a promo code! [15:13:55] 06Traffic: Allow acmecerts to deploy certificates in tmpfs storage - https://phabricator.wikimedia.org/T384227 (10Fabfur) 03NEW [18:21:03] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253 (10cmooney) 03NEW p:05Triage→03Medium [18:21:49] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477517 (10cmooney) [18:21:50] 10netops, 06Infrastructure-Foundations, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#10477518 (10cmooney) [18:23:09] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477523 (10cmooney) [19:50:47] 10netops, 06Infrastructure-Foundations, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258 (10cmooney) 03NEW p:05Triage→03Medium [19:54:15] 10netops, 06Infrastructure-Foundations, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477780 (10cmooney) [19:54:32] 10netops, 06Infrastructure-Foundations, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477781 (10ssingh) Thanks for filing this task and looking into it! Just one more data point: this seems to have started Friday Jan 17 a... [19:57:36] 10netops, 06Infrastructure-Foundations, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477783 (10ssingh) Might be a red herring: The only thing I see that might be close is https://sal.toolforge.org/log/h5lbdZQBKFqumxvtiNp... [19:57:42] 10netops, 06Infrastructure-Foundations, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477784 (10cmooney) >>! In T384258#10477781, @ssingh wrote: > Thanks for filing this task and looking into it! Just one more data point:... [20:08:33] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477785 (10cmooney) >>! In T384258#10477783, @ssingh wrote: > Might be a red herring: The only thing I see that might... [20:23:58] 10netops, 06Traffic, 06Infrastructure-Foundations, 06SRE: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#10477814 (10cmooney) 05Open→03Resolved Config is applied across the network now. Backup PyBal routes (where MED=100) are now gettin... [20:26:50] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10477829 (10cmooney) [20:31:05] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10477835 (10cmooney)