[01:37:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: (2) 61% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[01:42:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: (4) 63% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[03:46:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: (5) 46% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[03:51:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[03:56:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: (6) 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[06:54:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[06:59:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[07:12:26] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[07:14:39] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) We have discussed this issue in the #serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters...
[07:17:26] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[07:31:17] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10Joe) >>! In T306649#7883435, @elukey wrote:  > This will change the topology of the BGP mesh though: some nodes we'll have multi...
[08:10:20] <klausman>	 So I am still on my quest to set up the Ml staging k8s and am now looking at the BGP bits. Since this is going to be a private AS, I wondered if "allocation" for those is done by just adding them to https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations and then crafting a patch similar to this: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 ?
[08:10:57] <klausman>	 I.e. in this case 64608
[09:18:06] <klausman>	 vgutierrez: quick question, I am looking to alloc a v6 subnet for ml staging k8s services in codfw. Looking at https://netbox.wikimedia.org/ipam/prefixes/229/prefixes/, I'd use 2620:0:860:302::/64  Sound good?
[09:18:26] <vgutierrez>	 topranks, XioNoX ^^
[09:19:56] <klausman>	 I never know who to poke :D
[09:30:26] <volans>	 klausman: netops :D
[09:31:35] <klausman>	 Well, how does one know who's in netops? The wmoffice team names aren't all that useful in that regard.
[09:31:51] <vgutierrez>	 😅 topranks and XioNoX 
[09:31:59] <klausman>	 Mhm.
[09:34:35] <topranks>	 klausman: hey, that's is indeed us :)
[09:34:54] <topranks>	 That block looks like the right one to use alright - it's already assigned in Netbox for that purpose?
[09:35:01] <topranks>	 https://netbox.wikimedia.org/ipam/prefixes/387/prefixes/
[09:38:41] <topranks>	 Are there equivalent ml staging k8s services in Eqiad?
[09:39:19] <topranks>	 on the network side we've it grouped into "k8s-ml" and "k8s-stage" so I'm unsure which is equivalent
[09:42:11] <klausman>	 There is currently no equivalent cluster in eqiad, and we don't have plans for one
[09:42:45] <klausman>	 300/64 and 301/64 are "prod" (as opposed to staging)
[09:43:40] <klausman>	 We don't have a pod v6 allocation, but I could make that 303 (or mimic the existing stuff and make pods 302, and services 303)
[11:34:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) I think the best option is to use OIDC, however that comes with a couple of caveats.  1) We don't currently have OIDC support enabled in CAS so there could be some teething i...
[12:07:34] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Thanks for the updates.  Sounds like a good plan!  In terms of the configuration for the CR "nodeSelector" filter I thi...
[12:17:29] <topranks>	 klausman: sorry for the delayed response.  
[12:18:38] <klausman>	 np :)
[12:18:45] <topranks>	 the networks you assigned are fine anyway makes sense
[12:19:06] <topranks>	 In terms of the AS number, yes 64608 seems reasonable
[12:19:20] <topranks>	 We track those globally here: https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations
[12:19:57] <klausman>	 Already added :)
[12:20:06] <topranks>	 for now anyway.  We also have homer-specific configuration for each here: 
[12:20:07] <topranks>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/includes/customers/
[12:20:09] <klausman>	 I presume Netbox doesn't know anything about ASes?
[12:20:38] <topranks>	 No, we could add it as "config context" metadata but we've not come to a conclusion on what the best way is.
[12:21:25] <topranks>	 In terms of the homer config we might need to assess if there are any specific requirements for this new cluster
[12:21:36] <topranks>	 I'm guessing there probably aren't?
[12:22:38] <klausman>	 Nah, it's just anothe k8s cluster in codfw. It's only "special" in what we as a team would run there/what names we point at it
[12:23:08] <topranks>	 ok cool.  what you have there is fine I think.
[12:23:13] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 <- Id basically mimic this change with different details
[12:23:18] <topranks>	 Arzhel is back Friday I will run it by him.
[12:23:40] <topranks>	 Personally I'm not sure we need all the separate AS numbers / config snippets when a lot of it is the same.
[12:23:50] <klausman>	 ack.
[12:23:58] <topranks>	 But it's already done that way, so let's stick with the pattern, and probably Arzhel knows the good reason it's done that way
[12:24:04] <klausman>	 templating/DRYing out bits might be a good idea.
[12:25:19] <topranks>	 that patch you linked is indeed the basic approach to take.
[12:25:45] <topranks>	 The one exception is the templates/cr/bgp.conf file
[12:26:00] <topranks>	 We refactored that slightly during the week
[12:27:10] <topranks>	 The new shape of it is to have a file for the BGP peering like this one:
[12:27:12] <topranks>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/includes/bgp/k8s_mlserve.conf
[12:27:26] <topranks>	 Which then gets 'included' in cr.conf:
[12:27:27] <topranks>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/cr/bgp.conf#109
[12:27:39] <klausman>	 ah, ack. Sounds simple enough.
[12:28:02] <topranks>	 And also templates/asw/bgp_overlay.conf here:
[12:28:02] <topranks>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/asw/bgp_overlay.conf#23
[12:28:16] <topranks>	 I'm happy to have a look at that however 
[12:29:43] <klausman>	 Yeah, I have another patch (deployments/helm) that I need to get completed first and merged, and then I'll make one for the BGP stuff and have you and/or Arzhel review it.
[12:30:52] <topranks>	 ok cool, any questions or problems just ping me
[12:33:17] <topranks>	 My thinking on the naming would be to use "kubemlstage", in place of "kubemlserve", and otherwise follow that pattern
[12:41:06] <klausman>	 yarp :)
[13:11:11] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On reflection the above won't work if we're going to add the 'node-location' for all existing hosts, which I assume is...
[13:59:30] <wikibugs>	 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again: ` dwalden@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log Apr 27 13:58:00 d...
[17:32:59] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH)
[17:33:12] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH)
[17:34:18] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) @ayounsi & @cmooney: Would one of you take care of disabling this atlas anchor on our RIPE account and if needed, rotating any private keys or creds that may...
[17:35:28] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH)