[08:19:19] <moritzm>	 we have two hosts down ATM (which is discovered indirectly by deploying a fleet-wide security update and those gave errors): irc2001 and orespoolcounter2004. both are shown as completely unavailable for > 4 days when I check the host entry in icinga
[08:19:45] <moritzm>	 but there's no alert for this neither on icinga.wikimedia.org/alerts nor alerts.wikimedia.org?
[08:20:35] <moritzm>	 there is an indirect one ProbeDown one for irc2001, but I'd expect both to also alert on the hosts being non-available?
[08:41:22] <moritzm>	 I've just fixed both hosts (irc2001 is needed as a failover host for today's switch main), but that seems like a notable gap in alerting worth debugging to me?
[15:06:54] <cwhite>	 That's unusual for the host down alert to not appear on icinga.wikimedia.org/alerts unless it had downtime or an ack applied.  A host down alert would not appear on alerts.wikimedia.org because host alerts don't go to alertmanager.  They definitely fired an alert to IRC, though.
[18:29:54] <ryankemper>	 Hey o11y, I could use some help trying to get `avg_over_time` working on a range vector where I need to ignore null values within the range
[18:30:49] <ryankemper>	 `1 - avg_over_time(job_backend:trafficserver_backend_requests:avail5m{backend=~"wdqs.discovery.wmnet",site=~"codfw"}[45d])` works, but if I expand the range out broader (say 60d) I get `NaN` as the result instead of a scalar. I assume this is due to presence of null values near the beginning of that range
[18:56:04] <cwhite>	 It's definitely because there are nulls in the graph.  Out of curiosity, how come the computation window needs to be so large?
[19:53:00] <ryankemper>	 cwhite: this will be the recording rule for the metric used in https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1 , so I think it needs to be calculated over 90d window.
[19:53:15] <ryankemper>	 cwhite: here's what that currently looks like *before* my patch; https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/thanos/recording_rules.yaml#21
[19:58:41] <ryankemper>	 sorry specifically I meant it's the recording rule for the metric used in https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&viewPanel=9
[19:59:30] <ryankemper>	 The hope is that using the job_backend:trafficserver_backend_requests:avail5m, but avged over 90d w/o nulls, will be more performant than the previous approach, since `job_backend:trafficserver_backend_requests:avail5m` itself is already a recorded metric so computationally it should just be summing all the values and dividing
[20:00:12] <ryankemper>	 This has the drawback that the resulting SLI will be weighted by time instead of by volume (e.g. every 5 minute window's % availability contributes equally regardless of how many requests in the window), but if it fixes the performance problem that tradeoff is okay with us
[20:37:16] <cwhite>	 afaik, there is no good way to assign a value to null values.  The guidance I've heard given is "always have a value".