[00:47:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:55] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:41] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:51:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:41] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:18:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:43:40] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:53:12] I'm taking a look at the consumer lag alerts [10:21:14] while doing the bullseye reboots I think rebooting centrallog "unstuck" benthos mw-accesslog which is now processing and causing lag downstream in kafka -> logstash processing -.0 [10:21:18] -.- [12:22:16] lag will continue until the http access log topic is drained [12:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:39] nope lag on rsyslog-notice keeps actually increasing [13:01:32] still investigating [13:07:21] I was wrong earlier re: benthos-mw-accesslog-metrics lag, of course that doesn't impact logstash [13:13:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:14:31] ok got it https://phabricator.wikimedia.org/T366596 [13:14:53] I've pinged Ben on -sre [13:18:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:21:34] godog: I'm going to add a spam filter for that container. [13:21:53] sounds wise, thank you cwhite [13:57:52] godog: quick sanity check when you get a chance? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038787 [13:58:34] cwhite: yeah! LGTM [14:04:59] thanks! hoping the backlog will start dropping soon. still seeing an increase of ~1 million/minute to the notice topic [14:11:59] mmhh yeah we're not back yet to usual levels of input messages in kafka https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=thanos&var-kafka_cluster=logging-eqiad&var-cluster=logstash&var-kafka_broker=All&var-disk_device=All&from=1717488699290&to=1717510299290&viewPanel=54 [16:47:40] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:25] RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logstash2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:55] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:33:40] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:33:55] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag