[09:25:05] 10Traffic, 10SRE: haproxy tls terminator autobanning - https://phabricator.wikimedia.org/T306580 (10Volans) p:05Triage→03Medium [09:30:53] Hello, when you have a moment could you please set the priority of the following tasks: T305863, T305824, T305589, T304835. Thanks [09:30:53] T305824: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 [09:30:54] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [09:49:56] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:54:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:57:22] 10Traffic, 10Analytics, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @Vgutierrez for bringing this to our attention. I agree that we should try to find the cause of these errors and eradicate it if at all... [09:58:01] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) a:03BTullis [11:05:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:20:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [11:59:54] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) My suspicion is that these workers need more CPU and/or memory. We recently doubled the number of replica... [12:06:02] 10Traffic, 10SRE, 10Patch-For-Review: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) p:05Triage→03Medium [13:26:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10Volans) Thanks for opening the task to discuss details. As the first feedback I've a primary question that is how you envision this new third way... [13:27:32] volans: I updated the task, thanks. mmandere and I will be working on the reimaging next week and we can do the durum hosts later, maybe with the cookbooks [13:27:42] so not high priority or urgent in that sense [13:28:08] sukhe: thanks, as part of clinic duty I was asking around to set prio to tasks without it that have the SRE tag [13:28:48] aaah! haha [13:29:05] just that, nothing personal :) [13:29:08] sorry, I misunderstood, I thought it was from the PoV of the specific task itself :) [13:29:21] given that we were discussing the cookbook :) [13:29:24] sorry for not specifying that [13:40:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Spicerack: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) Yeah, I'm expecting Netbox to always be the source of truth so a homer run after a spicerack run would be a NOOP. `junos-eznc` is what I... [14:52:14] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) At the moment we are getting between ~30 and ~60 requests receiving 503 responses pe... [15:42:39] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have increased the amount of RAM available to the eventgate-analytics-external dep... [16:04:57] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) @arzhel fixed the reboot issue, the external disk attached to the router was causing the reboots. I updated JUNOS to junos-srxsme-20.2R3-S2.... [16:29:53] 10Traffic, 10Data-Engineering, 10Event-Platform, 10SRE, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Krinkle) [16:30:08] 10Traffic, 10Data-Engineering, 10Event-Platform, 10SRE, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Krinkle) [16:46:17] 10netops, 10Infrastructure-Foundations: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) [16:50:15] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [16:50:51] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) [16:57:25] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10ayounsi) Swap has been done successfully! Left to do: wipe the old one, rename the console server port of the new one. [16:58:18] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) loaded, configuration file verified working moved cables to new mr1-eqiad left scs connection to old mr1 to wipe, still requires scs connecti... [18:15:37] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) The RAM upgrade has not resulted in any improvement. {F35061891,width=60%} [18:35:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10Krinkle)