[02:33:56] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:33:56] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:33:56] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:39:52] there's also growing lag in consuming from codfw, will take a look [12:44:03] godog: for the codfw growing lag: https://phabricator.wikimedia.org/T366657 [12:44:24] cwhite: hah! thank you, makes sense [12:46:24] hello, quick naive question: I know that label cardinality is a real issue on prometheus and this is nothing to joke with for data retention's sake. Given that preamble, I'm curious about the extend of the exception we allow to exist in that domain. I'm trying to figure out a way to reproduce the behaviour that allowed [12:46:24] https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/20240602.txt (ctrl +f 1034) a mariadb error message to be attached to the alert payload. It seems obvious with icinga, but as prometheus goes, outside of that hacky way of misusing label, I'm a bit sceptic [12:57:16] arnaudb: good question, we've been wondering the same wrt providing some compatibility with icinga/nrpe scripts. We don't have anything firm yet though I've arrived at basically the same conclusion as you did so far: i.e. output in a label [12:57:34] which might be fine as long as it doesn't change very frequently, i.e. cardinality is fine [12:57:39] I feel dirty just thiking about it haha [12:57:57] heheh I hear you [12:58:10] do we have a loki instance ? this could be a job for a recording rule [12:58:34] I don't think I'm following, what do you have in mind ? [12:59:17] I'm not sure I'm not hallucinating, let me double check myself [13:03:22] in that case I was thiking about using the annotations field but that boils down to using the label indeed [13:04:15] that's right yeah [13:08:49] arnaudb: I'm happy to brainstorm together on what a solution to this looks like, also because we have to implement it anyways as a compatibility shim [13:09:36] hello folks! [13:09:41] I noticed https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&from=now-2d&to=now [13:09:55] not sure if already known, dropped a link in here for visibility [13:10:21] elukey: yes known :( https://phabricator.wikimedia.org/T366657 [13:10:33] thank you for the heads up tho [13:10:39] :( ack [13:12:41] godog:happy to brainstorm as well! I'm currently working on the different aspects of our alarming to alertmanager, I encountered that wall in the process. I'll send you a link to it to have a basis when i'm done writing! [13:12:57] arnaudb: great, cheers [14:33:56] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:33:56] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:53:41] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:53:56] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag