[13:25:29] (PuppetFailure) resolved: Puppet has failed on prometheus1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:37:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:42:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:48:15] that's me ^ working on https://phabricator.wikimedia.org/T351179 [14:32:33] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.monitoring.wmflabs.org/?q=alertname%3DLogstashNoLogsIndexed [14:50:28] godog will it mess up your work if I run puppet on prometheus1005? Added a blackbox check in https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281 and wanted to check it [14:51:26] inflatador: go for it! thank you for the heads up [14:51:54] godog np, running it now ;) [15:12:26] for `prometheus::blackbox::check::http`, do the check names need to be unique? I'm not seeing a new check on prometheus1005. Re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281/9/modules/profile/manifests/microsites/query_service.pp [15:15:36] looks like yes [15:15:45] based on modules/prometheus/manifests/blackbox/check/http.pp [15:27:17] inflatador: did puppet run (in general, but also did it run successfully?) on the host that uses profile::microsites::query_service ? [15:27:34] that needs to happen first for prometheus to consider the check, because exported resources [15:40:12] godog unsure, let me check which host/hosts use that. I was expecting a bb check to appear on prom1005 in `/etc/prometheus/blackbox.yml.d` but don't see it yet [15:42:06] looks like miscweb servers...running now [15:42:29] Y, getting a puppet error ` Duplicate declaration: Prometheus::Blackbox::Check::Http[query.wikidata.org] is already declared at` [15:49:15] theres-your-problem-jamie-hyneman.gif [16:27:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:32:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:06:03] \o Hi all I'm trying to set up grizzly dashboards for our (in-the-process-of-being-defined) search SLOs. I could use some help composing the graphite functions to do what I want. I've got a metric `Search.FullTextResults.p95` (see https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&orgId=1&viewPanel=14), and I'm trying to calculate the % of datapoints that are below a certain threshold (4000 in this [19:06:03] case) [19:06:38] So basically this is a combined latency/uptime SLO, I want uptime to basically be `% of datapoints in a 90day window that are below the latency threshold of 4000`