[15:00:11] FYI: Warning: Duplicate definition found for service 'check_ipsec' on host 'fran2001' (config file '/etc/icinga/objects/nsca_frack.cfg', starting on line 833) [15:00:15] on icinga [15:53:26] there are problems with icinga not properly handling downtime, see backscroll in #wikimedia-sre [16:22:26] Hey! I'm trying to make the CategoriesQueryServiceUpdateLagTooHigh alert notify less often (see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-wikidata-platform/blazegraph.yaml#62 / T420235). It seems that notification intervals are not something that can be configured by alert. Am I missing something? [16:22:27] T420235: CategoriesQueryServiceUpdateLagTooHigh is generating too much noise - https://phabricator.wikimedia.org/T420235 [17:14:59] hi folks.. I'm considering deploying an IPIP check for all realservers configured to handle IPIP inbound traffic, I'd need to deploy them on some sort of central host per site as it doesn't work on the loopback interface so 127.0.0.1 is out of question, do you see any potential issues that should prevent doing this? do we already have some checks puppetized like this? [17:15:24] the prometheus hosts on each DC are the obvious target I'm thinking [17:18:00] or should I use a good old icinga check for this? [17:21:18] gehel if you search thru modules/alertmanager/templates/alertmanager.yml.erb for 'repeat_interval' there are some examples to borrow from [17:22:31] vgutierrez: what would the nature of the check be, port probe? [17:23:53] sending an IPIP/IPIP6 encapsulated SYN packet [17:24:07] (and observing the SYN+ACK response) [17:24:38] the check code in python is basically this [17:24:46] https://www.irccloud.com/pastebin/TX6dvD9w/ [17:27:40] in some PoC tests it looks like the kernel doesn't like to receive encapsulated traffic on 127.0.0.1 [17:27:47] and I don't blame it TBH [17:31:02] herron: could it make sense to hack the existent port probe and implement IPIP/IP6IP6 encapsulation support in there? [17:48:43] or add an additional type of probe if that makes things easier of course [17:48:51] what are we using at the moment for the tcp probes? [17:49:14] vgutierrez: hmm I'm leaning towards either node/textfile exporter or a custom exporter depending if hosts can self-report, or if it needs to be probed from a central system. forking the blackbox exporter could work too but I'm thinking better to keep it decoupled so they can be upgraded independently. I looked around a bit for an exporter that already covers this but no luck so far [17:49:21] blackbox exporter [17:49:52] the approach I'm finding is monitoring over the tunnel and essentially assuming it works, which seems not great [18:08:37] herron: as mentioned above hosts can't self report [18:13:36] ah ok didn't catch that, I saw loopback and 127.0.0.1 are out but testing locally via eno and those configured ips are out as well? [18:15:40] yes [18:16:10] https://www.irccloud.com/pastebin/7TTuhZgW/ [18:16:28] quick test.. ncredir6001 targeting itself fails, it works as expected when targeting ncredir6002 [18:18:42] ok, so there's a couple options for getting these into metrics. we have the blackbox exporter where a short lived script can push metrics, or this type of probe could be made into its own prom exporter and scraped by prometheus, and there's the fork blackbox option. roughly in order of level of effort [18:19:17] sorry not blackbox exporter, push gateway [18:19:39] so what would be a good target to deploy the script? [18:20:07] it can be done centrally as well... so a puppetserver could test ALL realservers in any DC [18:20:18] or an alerts host [18:21:51] I'd say we start with push gateway, since it seems like prometheus-check-ipip is most of the way there already. and if we run into issues with that down the line spend more time and build it out into an exporter [18:22:42] and yeah alerts host probably makes sense [18:24:05] yeah.. the check is basically done [18:24:37] I need to hack the puppet queries to fetch all the realservers, their service IPs and so on [18:25:18] that would have been way easier if self-reporting would work [18:25:25] but I'm not that lucky [18:25:28] thx herron <3 [18:26:08] haha classic. there might be some hints in modules/grafana/files/grafana-datasource-exporter.py or modules/prometheus/files/pdb_resource_exporter.py re: push gateway plumbing [18:26:53] current port probe runs in a centralized fashion? [18:27:23] if that's the case.. what's the relevant puppet manifest to fetch all the servers behind a service? [18:29:30] yeah central from prom hosts, prometheus::targets::service_catalog should be a good place to branch out from [18:30:01] back in a few school pickup time