[01:37:56] (HAProxyEdgeTrafficDrop) firing: (2) 61% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:42:56] (HAProxyEdgeTrafficDrop) resolved: (4) 63% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:46:56] (HAProxyEdgeTrafficDrop) firing: (5) 46% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:51:56] (HAProxyEdgeTrafficDrop) firing: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:56:56] (HAProxyEdgeTrafficDrop) resolved: (6) 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:54:56] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:59:56] (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:12:26] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:14:39] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) We have discussed this issue in the #serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters... [07:17:26] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:31:17] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10Joe) >>! In T306649#7883435, @elukey wrote: > This will change the topology of the BGP mesh though: some nodes we'll have multi... [08:10:20] So I am still on my quest to set up the Ml staging k8s and am now looking at the BGP bits. Since this is going to be a private AS, I wondered if "allocation" for those is done by just adding them to https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations and then crafting a patch similar to this: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 ? [08:10:57] I.e. in this case 64608 [09:18:06] vgutierrez: quick question, I am looking to alloc a v6 subnet for ml staging k8s services in codfw. Looking at https://netbox.wikimedia.org/ipam/prefixes/229/prefixes/, I'd use 2620:0:860:302::/64 Sound good? [09:18:26] topranks, XioNoX ^^ [09:19:56] I never know who to poke :D [09:30:26] klausman: netops :D [09:31:35] Well, how does one know who's in netops? The wmoffice team names aren't all that useful in that regard. [09:31:51] 😅 topranks and XioNoX [09:31:59] Mhm. [09:34:35] klausman: hey, that's is indeed us :) [09:34:54] That block looks like the right one to use alright - it's already assigned in Netbox for that purpose? [09:35:01] https://netbox.wikimedia.org/ipam/prefixes/387/prefixes/ [09:38:41] Are there equivalent ml staging k8s services in Eqiad? [09:39:19] on the network side we've it grouped into "k8s-ml" and "k8s-stage" so I'm unsure which is equivalent [09:42:11] There is currently no equivalent cluster in eqiad, and we don't have plans for one [09:42:45] 300/64 and 301/64 are "prod" (as opposed to staging) [09:43:40] We don't have a pod v6 allocation, but I could make that 303 (or mimic the existing stuff and make pods 302, and services 303) [11:34:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) I think the best option is to use OIDC, however that comes with a couple of caveats. 1) We don't currently have OIDC support enabled in CAS so there could be some teething i... [12:07:34] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Thanks for the updates. Sounds like a good plan! In terms of the configuration for the CR "nodeSelector" filter I thi... [12:17:29] klausman: sorry for the delayed response. [12:18:38] np :) [12:18:45] the networks you assigned are fine anyway makes sense [12:19:06] In terms of the AS number, yes 64608 seems reasonable [12:19:20] We track those globally here: https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations [12:19:57] Already added :) [12:20:06] for now anyway. We also have homer-specific configuration for each here: [12:20:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/includes/customers/ [12:20:09] I presume Netbox doesn't know anything about ASes? [12:20:38] No, we could add it as "config context" metadata but we've not come to a conclusion on what the best way is. [12:21:25] In terms of the homer config we might need to assess if there are any specific requirements for this new cluster [12:21:36] I'm guessing there probably aren't? [12:22:38] Nah, it's just anothe k8s cluster in codfw. It's only "special" in what we as a team would run there/what names we point at it [12:23:08] ok cool. what you have there is fine I think. [12:23:13] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 <- Id basically mimic this change with different details [12:23:18] Arzhel is back Friday I will run it by him. [12:23:40] Personally I'm not sure we need all the separate AS numbers / config snippets when a lot of it is the same. [12:23:50] ack. [12:23:58] But it's already done that way, so let's stick with the pattern, and probably Arzhel knows the good reason it's done that way [12:24:04] templating/DRYing out bits might be a good idea. [12:25:19] that patch you linked is indeed the basic approach to take. [12:25:45] The one exception is the templates/cr/bgp.conf file [12:26:00] We refactored that slightly during the week [12:27:10] The new shape of it is to have a file for the BGP peering like this one: [12:27:12] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/includes/bgp/k8s_mlserve.conf [12:27:26] Which then gets 'included' in cr.conf: [12:27:27] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/cr/bgp.conf#109 [12:27:39] ah, ack. Sounds simple enough. [12:28:02] And also templates/asw/bgp_overlay.conf here: [12:28:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/templates/asw/bgp_overlay.conf#23 [12:28:16] I'm happy to have a look at that however [12:29:43] Yeah, I have another patch (deployments/helm) that I need to get completed first and merged, and then I'll make one for the BGP stuff and have you and/or Arzhel review it. [12:30:52] ok cool, any questions or problems just ping me [12:33:17] My thinking on the naming would be to use "kubemlstage", in place of "kubemlserve", and otherwise follow that pattern [12:41:06] yarp :) [13:11:11] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On reflection the above won't work if we're going to add the 'node-location' for all existing hosts, which I assume is... [13:59:30] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again: ` dwalden@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log Apr 27 13:58:00 d... [17:32:59] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) [17:33:12] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) [17:34:18] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) @ayounsi & @cmooney: Would one of you take care of disabling this atlas anchor on our RIPE account and if needed, rotating any private keys or creds that may... [17:35:28] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH)