[01:54:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [10:19:47] MichaelG_WMF: thank you for investigating re: forceLogin, that definitely rang a bell and I think it might be an instance of the underlying bug in T350108 [10:19:47] T350108: Grafana login fails to redirect back from an alert page - https://phabricator.wikimedia.org/T350108 [10:20:46] godog: ah yes, looks related! [10:23:48] MichaelG_WMF: if you could add your findings there I think that'll be helpful as a data point too [10:24:02] will do 👍 [10:29:49] ✅ done [10:31:44] thank you! [13:23:09] * kamila_ filing a ticket for the benthos lag, will look into it once sufficiently caffeinated [14:59:38] hey folks! [14:59:50] I notice this alert for kafka-logging1004: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=instance%3Dkafka-logging1004 [15:00:40] mediawiki.httpd.accesslog seems really big, a lot of partitions on the same node [15:00:50] I don't also see the MOTD that is strange [15:08:46] elukey: interesting, I'll take a look [15:49:13] filed https://phabricator.wikimedia.org/T384233 re: space increase, looks like udp_localhost-info did increase quite a bit since Jan 6th [15:50:12] cc swfrench-wmf maybe the above rings a bell ^ ? [16:05:13] How do you guys monitor that prometheus is sending the alerts correctly to alertmanager? (we just had an instance in cloud where the network got borked and was unable to connect to alertmanager and got no alerts at all xd) [16:14:57] godog: hmmm ... very strange timing for the sudden break in slope on the 6th in T384233 (e.g., no changes that plausibly correlate in SAL). I can take a closer look tomorrow, unless other service ops folks get there first. [16:14:58] T384233: Unexpected utilization increase in udp_localhost-info kafka-logging topic - https://phabricator.wikimedia.org/T384233 [16:15:35] swfrench-wmf: doh, please excuse the ping I totally forgot about MLK day :( enjoy your time off, this can totally wait tomorrow [16:16:04] no worries at all! :) [16:17:36] dcaro: I'm going shortly though tl;dr is that we do have "prometheus can't send alerts" alerts, which have an obvious flaw. there's also redundancy at the prometheus/alertmanager level, of course the nail in the coffin is end-to-end/watchdog alert and external checking, which we're planning to get to [16:21:24] fwiw we do externally monitor icinga though L) [16:21:28] * :) [16:32:07] godog: I've dropped a couple suspicions, but they're all posterior to 01:00:00 UTC on the 6th [16:45:14] godog: thanks! [22:09:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:19:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:29:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:31:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [22:34:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:36:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [22:39:40] RESOLVED: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag