[00:17:37] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [00:22:37] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [10:43:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:48:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:26:10] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [11:31:10] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [11:31:23] good morning, I've a question regarding downtimes/silences. We currently have a single failing systemd unit that is creating 6 alerts, 2 on icinga and 4 on alert.w.o. What's the best way to silence them without having to silence all of them? [11:32:31] see https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=geoip [11:33:14] and https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=geoip [12:25:05] volans: for systemdunitfailed (i.e. the prometheus/alertmanager) what I do is hit the date dropdown then "silence this alert" and remove the "unique" labels, in this case site and instance [12:26:05] for the icinga alerts I'd say the usual icinga procedure, looking at the bigger picture I believe both unit-specific and "check systemd state" icinga checks could be removed [12:30:24] off to lunch, bbiab [12:54:45] godog: yeah my main question is why a single unit failed triggers 6 alerts [12:56:50] also spicerack's icinga module has support for downtiming services, but the alertmanager one doesn't, hence it's not exposed to the downtime cookbook that could easily allow to do that and match them all [13:19:03] volans: good point re: spicerack and alertmanager! [13:20:10] re: the systemdunitfailed duplication that's due to the bug with @receiver in karma, I don't have the gh issue link handy though I think you remember [13:21:05] within a systemdunitfailed alert one is for puppetserver and the other is for puppetmaster [13:30:16] yeah sorry, I meant 5 per host :D (due to the dupliction bug) [13:31:34] I don't recall, but I thought there have been some work in the past to improve the failed systemd unit one. Ideally it should ignore all the units that are alerting on their own (I think there is a puppet parameter for it) [13:33:45] I can't recall that work atm, though yes now systemdunitfailed honors 'team' parameter when set in puppet for the unit [13:34:39] after the holidays I think we should move systemdunitfailed to critical and ditch the per-unit and systemd-wide icinga checks [13:52:29] herron: o/ lemme know if you have time to brainbounce https://phabricator.wikimedia.org/T352756 [17:16:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:21:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:57:54] godog: "(CertAlmostExpired) firing: Certificate for service kubestagemaster:6443 is about to expire" - I suspect that is from the blackbox probe I just added. Is there a way to tune parameters for the check? Staging certs have very short expiry time