[00:44:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:21:51] <vgutierrez>	 morning folks.. query question for you guys, I'm having issues rendering `haproxykafka_socket_circuit_breaker_active_seconds_total` on a way it's easy to understand. The metric tracks the total time a circuit breaker has been active. Last night in cp3072 the circuit breaker has been active 3.4 seconds in a 30 seconds window, value at 20:22:30 was 52.4 and at 20:23:00 was 55.8. Using rate() shows a constant rate of 
[09:21:51] <vgutierrez>	 62ms, increase() (even using a window of 2 minutes) shows data points of 6.94 seconds... I was thinking in showing this as as percentage of  healthy state (circuit breaker off) with `(1 - rate(haproxykafka_socket_circuit_breaker_active_seconds_total[2m])) * 100` but I'm open to suggestions here :)
[14:24:29] <herron>	 vgutierrez: for a simple option increasing the increase() window to say 30m displays the times when circuit breaker is active, for instance https://w.wiki/EwQt
[14:25:18] <vgutierrez>	 as a boolean (on/off) that works but I'd like to see how many seconds is the circuit breaker enabled 
[14:25:48] <vgutierrez>	 the data is there, I need a clear way of representing it
[14:27:59] <cdanis>	 vgutierrez: are you hoping to break it down per-node or per-cluster or ?
[14:29:34] <cdanis>	 probably this time interval is better to play with https://grafana.wikimedia.org/goto/eUQkAjQHg?orgId=1
[14:31:56] <vgutierrez>	 for the drilldown dashboard per node
[14:32:04] <vgutierrez>	 and for the overall one, per cluster
[14:32:47] <cdanis>	 https://grafana.wikimedia.org/goto/rQlI0CwHg?orgId=1
[14:39:33] <vgutierrez>	 why count() - rate()?
[14:41:49] <cdanis>	 i could have just done 1 - (rate()/count())
[14:42:14] <cdanis>	 you need a denominator if you want whole-cluster seconds/second ratio
[14:44:01] <herron>	 vgutierrez: fwiw that'll give you count of seconds by node, and can aggregate as needed.  or yeah can visualize as percentage of time in circuit breaker.  the counter doesn't seem to change very often
[19:54:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:19:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:23:45] <cwhite>	 hmm, ~13k/msg/s of "Overriding meta.dt in event <> of schema at <> from <> to <>" from eventgate
[20:24:07] <ottomata>	 !  here
[20:24:15] <ottomata>	 i waited for a patch to go out with train to avoid that.
[20:24:23] <ottomata>	 i just deployed that to eventgate main shortly ago
[20:24:51] <ottomata>	 cwhite: quick link to logstash doc?
[20:24:55] <ottomata>	 i want to look at culprit
[20:25:23] <cwhite>	 thanks ottomata! https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.08.04?id=Gim6dpgBAlpnrixJuB7B
[20:25:54] <ottomata>	 ah, i did not expect that...i think that is in mw core then...
[20:26:00] <ottomata>	 might have to roll back
[20:26:03] <ottomata>	 ...
[20:26:47] <ottomata>	 yup i found it.
[20:26:48] <ottomata>	 hm
[20:29:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:29:59] <ottomata>	 I will roll back eventgate-analytics
[20:43:02] <ottomata>	 cwhite: i'm having trouble rolling back now, because of k8s weirdness.  asking in -serviceops
[20:43:10] <ottomata>	 but i have to run to do day care duties soon!
[20:43:13] <ottomata>	 the patch is merge.
[20:43:15] <ottomata>	 meregd.
[20:43:26] <ottomata>	 eventgate-analytics in eqiad just needs deployed
[20:47:25] <cwhite>	 thanks, ottomata!  ttyl :)
[21:19:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag