[06:49:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:54:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:23:56] (EdgeTrafficDrop) firing: 56% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:28:56] (EdgeTrafficDrop) resolved: (2) 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [10:10:56] (EdgeTrafficDrop) firing: 55% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [10:15:56] (EdgeTrafficDrop) resolved: 55% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [10:28:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 from inference.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:33:51] <_joe_> uh [10:33:58] <_joe_> what's going on in codfw/eqiad? [10:36:39] err [10:36:50] I don't know if it's real or a dashboards artifact [10:36:59] consider https://grafana.wikimedia.org/d/000000093/varnish-traffic?orgId=1&viewPanel=8&from=now-6h&to=now [10:37:22] every time I hit reload the spike at 10:10 comes and goes randomly [10:37:50] godog: ^^ [10:38:19] I was noting the same [10:38:49] at least I'm as sane as volans \o/ [10:41:24] vgutierrez: I think it is an artifact due to prometheus1006 reboot [10:41:35] I'd recommend switching the dashboard to use thanos as the data source [10:41:43] 10Traffic, 10DNS, 10SRE: Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10Sebastian_Berlin-WMSE) [10:47:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:50:56] godog: ack [10:51:31] do we have documentation on that regard? [10:51:55] * vgutierrez checking https://wikitech.wikimedia.org/wiki/Thanos [10:52:17] yes that's the correct page [10:54:04] so only dashboards that require a global view should be ported to Thanos, right? [10:54:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) p:05Triage→03Medium [10:55:17] no any dashboard will benefit, perhaps the docs aren't clear about that [10:56:54] so every dashboard should fetch data from thanos instead of the specific prometheus/ops site? [10:58:07] yes that's correct, there's a bunch of benefits including caching and merging of results from redundant prometheus [10:58:22] awesome [11:07:59] 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10Vgutierrez) [11:08:27] kwakuofori: ^^ let's discuss this on the Traffic meeting later today please [11:09:56] 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10Vgutierrez) p:05Triage→03Medium [11:15:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [11:17:02] godog: I've created T302266 to track it internally [11:17:03] T302266: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 [11:22:53] vgutierrez: very nice, thank you sir [12:04:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @fgiunchedi Hey, also struggling somewhat with this. That IP is currently... [12:12:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture also containing requests from Prometheus1005 (10.64.0.82) which do... [12:30:20] vgutierrez: Sure [12:56:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture of requests directly on primary interface, to get full Ethernet he... [13:07:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [15:18:09] 10Traffic: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10MMandere) [15:20:09] 10Traffic: Move Varnish6 from component to main - https://phabricator.wikimedia.org/T302301 (10MMandere) p:05Triage→03Medium [16:05:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) I've tried reducing the workload on 1006 to test the theory that someho... [16:27:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Following the "sth to do with icmp rate limit" lead I have: * temporar... [16:30:25] 10netops, 10Infrastructure-Foundations: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) p:05Triage→03High [16:30:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) I don't have time to dive too deep but: consider there's also a ping-offloa... [17:44:33] 10netops, 10Infrastructure-Foundations, 10SRE: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) > 2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra. >So 10.3.0.1 on cr1-e... [17:56:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack thanks for the input. We've validated our ping-offload is not inv... [17:59:10] 10netops, 10Infrastructure-Foundations, 10SRE: Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10cmooney) Given it came up as part of an incident report I'll explicitly mention we need to consider our "network only" POPs, like eqord, as part of this. The key balance we ne... [18:01:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) What I mean is looking at a different layer of the ping-offload part: the c... [18:06:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack ok I understand where you're coming from. We didn't see any of... [18:17:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) The relevant settings on the LVSes are in `modules/lvs/manifests/kernel_con... [18:21:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) Stepping back out to the broader question again though: I get why we normal... [18:44:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > The ratelimit sounds similar, but the difference is that it's per-target... [20:16:57] (EdgeTrafficDrop) firing: 52% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org [20:21:57] (EdgeTrafficDrop) resolved: 61% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org