[22:07:16] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:37:16] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:47:16] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:08:10] ^ I'm taking a look. [23:20:24] The kafka consumer graphs look healthy, I'm taking a look at logstash's ingestion. [23:22:16] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:26:35] Logstash's ingestion rate seems healthy. I'm taking a look at the disk usage of logstash in eqiad to see if the instances are healthy. [23:27:16] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:39:41] I've ACK'd the alerts, this graph is relevant for the issue: https://grafana.wikimedia.org/goto/BSEJn9JIz [23:44:01] The logstash clusters are healthy. [23:51:51] I think this issue is related to T337818 [23:51:52] T337818: apache2 cpu-stuck on logstash hosts causes kafka logging lag - https://phabricator.wikimedia.org/T337818 [23:57:38] I can see a drop in the Logstash input rate for logstash1023 that correlates with an increase in 'tripped in_flight_requests' errors on logstash1023. I'm taking a deeper look at that host. https://grafana.wikimedia.org/goto/cHYbV9JIz?orgId=1