[10:28:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 from inference.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:47:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [10:54:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) p:05Triage→03Medium [11:15:19] godog: so the patch to spicerack got merged, you should be able to get CI work locally, just nuke .tox and rebase form master [11:15:28] lmk if you have any issue, sorry for the trouble [11:15:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [11:16:36] volans: ack, thanks! I'm trying now, no worries though [11:16:51] I've been busy with T302265 heh [11:16:51] T302265: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 [11:17:00] a riddle for me so far [11:17:45] volans: I can confirm tox works as expected now, thanks again [11:19:09] thank you, as for the icmp stuff, do one prometheus live in teh same raw of the lvs and the othe rnot? [11:19:22] you might find here fine people to help you on that side of things :-P [11:20:58] mhh I haven't checked row distribution no, going to lunch shortly but I'll resume afterwards [11:27:51] but no lvs1019 is row C, prometheus1005 row A and prometheus1006 row B [11:27:58] ok lunch now [11:28:13] what's the issue? [11:29:14] XioNoX: I've put a summary in T302265 but tl;dr can't get icmp echo replies from blackbox-exporter on prometheus1006 for some ips (but ping works) [11:29:14] T302265: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 [11:29:23] got lunch for reals now! ttyl [11:33:35] you got my curiosity at "but ping works", I guess it's time to make some tea and dig [11:38:55] :D [11:49:36] how can I figure out what server answers those pings? where is that VIP routed to? [11:50:10] lvs [11:52:45] volans: I mean which real-server [11:54:07] I would expect ICMP to be answered by the lvs itself and not being routed at all [11:54:30] icmp is redirected to a special service for it [11:54:52] ah right, I forgot about our ping sinkhole [11:55:21] https://wikitech.wikimedia.org/wiki/Ping_offload [11:56:06] volans: nah that's only for text-lb [11:56:08] question_mark: but does that do offload also of internal pings? [11:56:23] to svc IPs that is [11:56:51] arzhel better answer that, he set it up I think :) [11:57:14] yeah, but that's only for text-lb from external [12:04:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @fgiunchedi Hey, also struggling somewhat with this. That IP is currently... [12:04:48] godog, XioNoX: I updated the task there ^^ [12:05:09] Unfortunately I am lost as to what is happening. [12:05:56] TL;DR echo req's always get to lvs1019 over primary interface, but it is not responding to the ones generated by the Prometheus exporter [12:08:21] topranks: did you check the ICMP IDs on prometheus1005? (as it works) [12:08:41] I didn't no [12:12:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture also containing requests from Prometheus1005 (10.64.0.82) which do... [12:12:55] Uploaded another PCAP there. [12:13:18] IDs are differrent in requests from prometheus1005 (as you'd expect), but structure is basically the same. [12:14:04] also btw - for anyone looking at the pcap - it looks like lvs1019 responds twice, it does not. [12:14:27] Reason is, due to assymetric reply, I need to capture with "-i all" as interface for tcpdump. [12:14:45] Which means it captures the packet once on the vlan sub-interface (no 802.1q header) [12:14:55] And again on the parent physical (with Vlan tag) [12:21:51] does the issue happen when restarting blackbox exporter? (it sets the ICMP ID based on the PID) [12:21:54] godog: ^ [12:24:00] https://github.com/prometheus/blackbox_exporter/blob/master/prober/icmp.go#L50 [12:28:11] XioNoX: yeah I was wondering that, and half expecting it might work if we restarted process. [12:32:43] Looking at these vars based on a discussion me and volan.s were having, I'm wondering if this rate limit stuff is somehow part of the problem: [12:32:44] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/lvs/manifests/kernel_config.pp#49 [12:34:10] seems to be a global setting - so unlikely to be the issue (given we are seeing problem only from one host) [12:34:31] or it's a kernel bug :) [12:41:54] Maybe rate limiting is done on the ICMP ID? [12:42:08] the ping command uses a different PID as it's a new command each time [12:43:56] thanks for taking a look XioNoX topranks ! [12:44:10] I just bounced prometheus-blackbox-exporter on prometheus1006 to test [12:44:17] IME it doesn't change [12:44:41] it == the result [12:46:00] yeah so bizarre [12:46:16] I'm looking at the errors on prometheus1006 with this journalctl -u prometheus-blackbox-exporter.service -f | grep -i 'icmp_.*probe failed' | cut -d' ' -f6- [12:53:54] I think the real thing we need to understand is why lvs1019 doesn't respond. [12:54:05] The ICMP IDs did change after the restart. [12:54:51] XioNoX's theory about rate limiting might still be right though - if perhaps it's the source IP that it's looking at? [12:55:07] But I guess that doesn't make sense cos the manual ping works [12:55:16] yeah and prometheus1005 [12:56:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) Capture of requests directly on primary interface, to get full Ethernet he... [12:57:00] you're welcome btw! /s [12:57:32] in case we were short on riddles [12:57:54] haha thanks godog :) [13:01:07] also still unclear to me why 10.2.2.63 or 10.2.2.60 or 10.2.2.12 only seem to be affected [13:01:16] or at least persistently affected [13:04:29] ah, it's not all the .svc. endpoints? [13:05:16] no :( probes go out for all services in service::catalog [13:05:26] let me clarify that in the task [13:06:23] godog: same amount of probes from prometheus1005? [13:07:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) [13:07:16] yeah workload is the same XioNoX [15:29:14] for the curious, I think it has to do with icmp rate limit; I've changed a little bit the list of targets on prometheus1006 and I don't see the probe failures anymore [15:29:40] have a meeting now, will keep looking later [16:02:11] yeah getting failures for 10.2.2.60 only now [16:05:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) I've tried reducing the workload on 1006 to test the theory that someho... [16:27:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Following the "sth to do with icmp rate limit" lead I have: * temporar... [16:30:25] 10netops, 10Infrastructure-Foundations: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10ayounsi) p:05Triage→03High [16:30:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) I don't have time to dive too deep but: consider there's also a ping-offloa... [17:44:33] 10netops, 10Infrastructure-Foundations, 10SRE: Suboptimal anycast routing from leaf switches - https://phabricator.wikimedia.org/T302315 (10cmooney) > 2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra. >So 10.3.0.1 on cr1-e... [17:56:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack thanks for the input. We've validated our ping-offload is not inv... [17:59:10] 10netops, 10Infrastructure-Foundations, 10SRE: Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355 (10cmooney) Given it came up as part of an incident report I'll explicitly mention we need to consider our "network only" POPs, like eqord, as part of this. The key balance we ne... [18:01:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) What I mean is looking at a different layer of the ping-offload part: the c... [18:06:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) @bblack ok I understand where you're coming from. We didn't see any of... [18:17:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) The relevant settings on the LVSes are in `modules/lvs/manifests/kernel_con... [18:21:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10BBlack) Stepping back out to the broader question again though: I get why we normal... [18:44:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10cmooney) > The ratelimit sounds similar, but the difference is that it's per-target... [20:06:45] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) [20:07:24] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) p:05Triage→03High [22:50:14] Hello all! I'm trying to understand this function: [22:50:16] https://www.irccloud.com/pastebin/n36qrG8S/ [22:50:50] It raises an exception if the host has been up too long, correct? [22:51:05] (where in theory 'too long' means 'since before we tried to reboot it' I guess?) [23:16:03] ok, now I care less -- that function raised an exception during a debian install but retrying it seems fine *shrug*