[08:19:19] we have two hosts down ATM (which is discovered indirectly by deploying a fleet-wide security update and those gave errors): irc2001 and orespoolcounter2004. both are shown as completely unavailable for > 4 days when I check the host entry in icinga [08:19:45] but there's no alert for this neither on icinga.wikimedia.org/alerts nor alerts.wikimedia.org? [08:20:35] there is an indirect one ProbeDown one for irc2001, but I'd expect both to also alert on the hosts being non-available? [08:41:22] I've just fixed both hosts (irc2001 is needed as a failover host for today's switch main), but that seems like a notable gap in alerting worth debugging to me? [15:06:54] That's unusual for the host down alert to not appear on icinga.wikimedia.org/alerts unless it had downtime or an ack applied. A host down alert would not appear on alerts.wikimedia.org because host alerts don't go to alertmanager. They definitely fired an alert to IRC, though. [18:29:54] Hey o11y, I could use some help trying to get `avg_over_time` working on a range vector where I need to ignore null values within the range [18:30:49] `1 - avg_over_time(job_backend:trafficserver_backend_requests:avail5m{backend=~"wdqs.discovery.wmnet",site=~"codfw"}[45d])` works, but if I expand the range out broader (say 60d) I get `NaN` as the result instead of a scalar. I assume this is due to presence of null values near the beginning of that range [18:56:04] It's definitely because there are nulls in the graph. Out of curiosity, how come the computation window needs to be so large? [19:53:00] cwhite: this will be the recording rule for the metric used in https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1 , so I think it needs to be calculated over 90d window. [19:53:15] cwhite: here's what that currently looks like *before* my patch; https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/thanos/recording_rules.yaml#21 [19:58:41] sorry specifically I meant it's the recording rule for the metric used in https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&viewPanel=9 [19:59:30] The hope is that using the job_backend:trafficserver_backend_requests:avail5m, but avged over 90d w/o nulls, will be more performant than the previous approach, since `job_backend:trafficserver_backend_requests:avail5m` itself is already a recorded metric so computationally it should just be summing all the values and dividing [20:00:12] This has the drawback that the resulting SLI will be weighted by time instead of by volume (e.g. every 5 minute window's % availability contributes equally regardless of how many requests in the window), but if it fixes the performance problem that tradeoff is okay with us [20:37:16] afaik, there is no good way to assign a value to null values. The guidance I've heard given is "always have a value".