[05:48:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:58:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:12:35] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) >>! In T306649#7931058, @akosiaris wrote: >> Regarding the "fake nodes": I think that could be done with adding the le... [06:29:06] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Plus, they are VMs and we have the same problem we have with the kask dedicated nodes (also VMs). Netbox doesn't have... [08:18:51] 10Traffic, 10SRE, 10Patch-For-Review: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Marostegui) p:05Triage→03Medium [08:19:31] 10Traffic, 10RESTBase-API, 10SRE, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Marostegui) p:05Triage→03Medium [08:19:47] 10netops, 10Infrastructure-Foundations, 10observability: Grafana posting to http://wpt-graphite.wmftest.org:8080/ - https://phabricator.wikimedia.org/T307445 (10Marostegui) [08:20:32] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Marostegui) p:05Triage→03Medium [08:20:44] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10Marostegui) p:05Triage→03Medium [09:01:42] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7933266, @JMeybohm wrote: >>>! In T306649#7931058, @akosiaris wrote: >>> Regarding the "fake nodes": I... [13:56:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff, Can we plan to have ganeti4002 drained of activity for me on Thursday, May 19th, so I can swap out the defective memory stick? [13:59:25] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:... [14:30:22] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7934722, @elukey wrote: > I merged two changes for the ml-serve-eqiad cluster, and now the concerns ex... [14:42:09] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [15:44:47] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10AlexisJazz) Right now it works, as usual with these it was a transient error. [16:59:14] I am looking for reviews for these: https://gerrit.wikimedia.org/r/c/operations/puppet/+/791673 https://gerrit.wikimedia.org/r/c/operations/puppet/+/791678 it's about deleting expired globalsign and digicert certs. But maybe I asked before and there was a reason to keep them. [17:00:00] it's just popping back up in the context of "are there certs that need monitoring and don't have it"in the context of an expired etcd cert. and from there I got to "delete all the expired certs" [21:54:21] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE-tools: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) I've a local patch that I'm testing to perform the validation of the whole dataset (manual + netbox). The preliminary results are b...