[04:05:44] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) [04:10:16] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) [04:25:18] 10Traffic, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) Hmm, it works again. [09:50:24] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) I am also experiencing unreliability. Particularly when trying to save edits. In logstash I am seein... [10:28:34] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster [10:33:35] 10Traffic, 10SRE: Remove component/varnish6 repo reference in Varnish Test Dockerfile - https://phabricator.wikimedia.org/T302579 (10MMandere) 05Open→03Resolved a:03MMandere Varnish Containerized test correctly pulls packages from `main` component and has dropped `component/varnish6` from the repolist. [10:34:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [10:37:26] ^^ expected [10:38:09] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) ` root@deployment-cache-text06:/var/log/trafficserver# for i in {1..5}; do nc -zv -w 5 deployment-med... [10:49:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [10:50:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [10:55:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [10:56:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [10:57:28] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) apache2 is currently screaming on deploiyment-mediawiki11: ` Feb 28 10:54:28 deployment-mediawiki11 a... [11:06:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1088:9331 is unreachable - https://alerts.wikimedia.org [11:12:53] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster c... [11:14:05] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:29:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster [11:36:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [11:41:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [12:13:38] godog: I'm trying to measure TTFB data from pybal<-->varnish healtchecks on the TLS termination layer. that implies that somehow I need to filter by host header. But I'm assuming that using the host header would increase cardinality to a painful level for that metric [12:14:15] so IMHO adding a new metric that only delivers TTFB for varnishcheck.wikimedia.org would be a better solution [12:14:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [12:15:00] hmmm healthcheck.wikimedia.org actually :) [12:19:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp5011:9331 is unreachable - https://alerts.wikimedia.org [12:25:15] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster c... [12:39:35] vgutierrez: +1 on a single metric yeah, since host: is user-supplied putting it into a metric is a no-no as you guessed [15:33:04] 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) Some dashboards, e.g https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1 have their datasource set to `[eqiad codfw] prometheus/global` and contains user defi... [15:43:14] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/766771/ makes sense to you? [15:47:12] vgutierrez: yep [20:32:24] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10herron)