[13:25:29] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on prometheus1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:37:13] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[13:42:13] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[13:48:15] <godog>	 that's me ^ working on https://phabricator.wikimedia.org/T351179
[14:32:33] <jinxer-wm-test>	 (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.monitoring.wmflabs.org/?q=alertname%3DLogstashNoLogsIndexed
[14:50:28] <inflatador>	 godog will it mess up your work if I run puppet on prometheus1005? Added a blackbox check in https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281 and wanted to check it
[14:51:26] <godog>	 inflatador: go for it! thank you for the heads up
[14:51:54] <inflatador>	 godog np, running it now ;)
[15:12:26] <inflatador>	 for `prometheus::blackbox::check::http`, do the check names need to be unique? I'm not seeing a new check on prometheus1005. Re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281/9/modules/profile/manifests/microsites/query_service.pp
[15:15:36] <inflatador>	 looks like yes
[15:15:45] <inflatador>	 based on modules/prometheus/manifests/blackbox/check/http.pp
[15:27:17] <godog>	 inflatador: did puppet run (in general, but also did it run successfully?) on the host that uses profile::microsites::query_service ?
[15:27:34] <godog>	 that needs to happen first for prometheus to consider the check, because exported resources
[15:40:12] <inflatador>	 godog unsure, let me check which host/hosts use that. I was expecting a bb check to appear on prom1005 in `/etc/prometheus/blackbox.yml.d` but don't see it yet
[15:42:06] <inflatador>	 looks like miscweb servers...running now
[15:42:29] <inflatador>	 Y, getting a puppet error ` Duplicate declaration: Prometheus::Blackbox::Check::Http[query.wikidata.org] is already declared at`
[15:49:15] <godog>	 theres-your-problem-jamie-hyneman.gif
[16:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:32:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:06:03] <ryankemper>	 \o Hi all I'm trying to set up grizzly dashboards for our (in-the-process-of-being-defined) search SLOs. I could use some help composing the graphite functions to do what I want. I've got a metric `Search.FullTextResults.p95` (see https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&orgId=1&viewPanel=14), and I'm trying to calculate the % of datapoints that are below a certain threshold (4000 in this 
[19:06:03] <ryankemper>	 case)
[19:06:38] <ryankemper>	 So basically this is a combined latency/uptime SLO, I want uptime to basically be `% of datapoints in a 90day window that are below the latency threshold of 4000`