[17:13:00] godog: is there a way to alert on staleness of data? Should I be worried about that when creating alerts? [17:32:45] dcaro: staleness of what kind of data? you mean in Prometheus itself? [17:38:41] cdanis: yes, as in, I'm alerting on stat ceph_health_status being certain value, but what happens if that stat has not been reported for a time? [17:38:58] eventually that value be present, or will be missing for those scraped hosts [17:39:09] there's also separate metrics about whether or not expected targets are there [17:39:16] a prometheus-internal one that is literally just called `up` for instance [17:39:53] what if the name of the stat changed? [17:40:08] (so the target is scraped, but that stat is not being reported, or the name changed) [17:40:27] (also, in case the target was not scraped, would I get an alert?) [17:44:27] I guess that my goal is to get alerted when that stat hits a value (that I got :) ), and when that stat has not been reported in a while (be that because the scrape failed, or the name of the stat changed, or anything else unexpected) [17:44:49] you could check that the cardinality of the metric matches the expected, I suppose [17:45:02] count(ceph_health_status) [17:47:47] hmm, interesting, I can add something like " and count(metric1) != 0 and count(metric2) != 0" to each alert xd, though probably would create a new one for each set of alerts to warn about staleness instead, does that make sense? [18:53:01] dcaro: I'm not sure I understand what you're worried about here honestly? [18:53:18] if Prometheus/Alertmanager isn't evaluating alert rules in a timely fashion, there are other alerts for that [18:53:36] if the metric doesn't exist or a scrape hasn't been successful in a while, then the count will evaluate to 0 [19:03:18] cdanis: the concern is mostly that some alerts should be acted on asap, as in paging (for example), so if that alert is not triggering because the metric is stale or changed names, we want to page too (as that means that we are actually not really alerting on what we thought we were) instead of waiting to get that sorted out whenever the underlying issue is sorted out (or if it's [19:03:20] even, as in metric name changes, nobody would be alerted if I understood correctly) [19:05:13] so, that's not something we're generally in the habit of doing, but the count() thing I outlined would allow you to page on the lack of scrape or on the metric having changed names [19:05:31] I think there is also some confusion arising from how 'stale' doesn't work like you think it does in Prometheus though :) [19:06:13] I'm interested on exploring that :), can you elaborate? [19:07:03] there's no persistence of values [19:07:10] the timeseries exists within the query window, or it doesn't [19:10:00] ok, not sure I understand the implications of that, how is that different from what I think it means? [19:16:10] gtg, but I'm really interested on exploring this, thanks! [19:31:18] it has already been mentioned few times here, but I repeat it for the sake of context. To solve most of the underlying issues (wrong metric name, no data, etc..) we could (and IMHO we should) setup CF's Pint: https://cloudflare.github.io/pint/ it's relatively new but seems to address most of the concerns around alerting with prometheus [19:59:00] volans: thanks for sharing, I'll take a look at it. [20:00:06] relevant: https://phabricator.wikimedia.org/T309182 [20:01:19] volans: I'll add it as a topic for next team meeting.