[10:57:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:02:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:22:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:04:15] looks like eventgate-analytics-external is logging incredibly hard (13,000 msg/sec). /me setting up a filter [17:10:45] ottomata: ^^ [17:13:48] also cc tchin, there was an eventgate-analytics-external deploy earlier [17:14:10] lines up more or less directly with it [17:18:31] mitigation is rolling out now [17:27:14] Mitigation applied. I doubt codfw lag will decrease due to the increased latency that artificially caps its consumer potential, but eqiad is keeping up. [17:32:01] Submitted a 24h silence for the LogstashKafkaConsumerLag alert just for the logstash7-codfw topic. Will look again in a bit. [17:45:23] o/ looking [17:46:06] there is a known issue with the train breaking mediawiki.ip_reputation_score stream, but I can't imagine that would cause this [17:48:01] o/ [17:48:13] the spike lines up very closely with the eventgate-analytics-external deploy at 16:17 [17:48:38] might just be a case of a rollback if that's safe [17:49:36] indeed. [17:49:39] yes i think so [17:49:46] i think this is the new service-utils based eventgate [18:04:19] tchin: is rolling back now