[12:32:41] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:37:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:08:19] <inflatador>	 godog was looking at adding hostname to ProbeDown alerts ( re: https://phabricator.wikimedia.org/T356140#9527892  ) . Do you have any more feedback/examples on how to do this?
[13:11:25] <godog>	 inflatador: not really, sorry, making such things easier is one of the things we'll be doing this quarter
[13:12:36] <inflatador>	 godog np. /me did a lot of transactional/ticket routing optimization stuff in a past life, so if I can help in any way feel free to reach out
[13:12:50] <godog>	 for sure, thank you inflatador 
[13:43:10] <topranks>	 folks logstash1036 is in eqiad rack E1 where we're doing a switch upgrade later today 
[13:43:26] <topranks>	 https://phabricator.wikimedia.org/T365993
[13:43:44] <topranks>	 the JunOS upgrade is going to mean ~20 mins downtime, is that likely to be an issue?
[13:43:54] <topranks>	 (sry should have touched base earlier on this)
[13:47:15] <godog>	 topranks: afaik not a problem no, my recollection is that we might want to do some operation on the opensearch cluster beforehand to be nice (cc cwhite)
[13:47:58] <topranks>	 godog: ok thanks, if it seems like it will be a problem let me know it's possible to delay/postpone if needed 
[13:48:29] <cwhite>	 thanks for the ping.  topranks, remind me what time the maint will begin?
[13:48:48] <topranks>	 scheudled for 15:00 UTC, so 72 minutes from now 
[13:49:52] <cwhite>	 no trouble at all.  I'll make the necessary arrangements
[13:49:54] <cwhite>	 thanks!
[13:49:58] <topranks>	 cwhite: super, thanks!
[13:50:03] <godog>	 thank you cwhite topranks 
[13:50:45] <topranks>	 FYI we also have logstash1037 and titan1001 coming up on Thur Jul 11th 
[13:51:12] <topranks>	 https://phabricator.wikimedia.org/T365996
[13:51:27] <topranks>	 cwhite: although we may need to postpone today's works I will let you know 
[13:51:34] <topranks>	 for now I guess take no action?
[13:52:54] <cwhite>	 ok, no problem either way.  thanks for letting us know :)
[13:54:50] <godog>	 indeed, thank you
[13:55:09] <topranks>	 sorry guys yeah, we're gonna postpone today's work until next Wednesday July 10th at the same time 
[13:58:15] <godog>	 ok np topranks 
[14:23:55] <kamila_>	 o/, looks like centrallog1001's NIC is getting a tad hot during spikes and I'm about to make it slightly worse, not urgent but I figured I should mention, do you want a ticket? https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin=&from=1719325279278&orgId=1&to=1719930079278&var-cluster=syslog&var-datasource=thanos&var-server=centrallog1002&refresh=5m&viewPanel=11 
[14:24:09] <dcausse>	 o/ when migrating an icinga alert based on monitoring::graphite_threshold that uses a severity/threshold (e.g. https://gerrit.wikimedia.org/g/operations/puppet/+/b1bf8255e7b8ba9efcf27641530e54ee4dc6e7c5/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#44)
[14:24:46] <dcausse>	 can we do something similar with alertmanager? or does it need 2 separate alerts?
[14:26:27] <godog>	 kamila_: thank you for the heads up, unless it goes up to full saturation I think we're good
[14:26:31] <godog>	 dcausse: checking
[14:26:39] <kamila_>	 godog: ok, thanks :-) 
[14:27:18] <kamila_>	 (I'm not adding more volume, just more batching, so I won't saturate it)
[14:27:52] <godog>	 dcausse: yeah two separate alerts (same name though) is how we do it, personally I'd recommend ditching the warning and keep only the critical
[14:28:09] <godog>	 or at least start only with a critical one, and add back the warning
[14:28:17] <dcausse>	 godog: sounds good, thanks!
[14:28:43] <godog>	 sure np dcausse, thanks for reaching out
[14:39:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: burrow-jumbo-eqiad.service on kafkamon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:45] <herron>	 huh Jul 02 14:34:21 kafkamon1003 Burrow[2449]: fatal error: runtime: out of memory
[14:44:57] <herron>	 bounced the service just now
[14:49:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: burrow-jumbo-eqiad.service on kafkamon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:28] <dcausse>	 when submitting a TimingMetric from MW StatsFactory is there anything I should do to get the related _bucket series? context is mediawiki_cirrus_search_request_time (https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/44639a2b42af8b04903c94c20c157f563acca211/includes/ElasticsearchIntermediary.php#277) but not finding it from thanos nor the grafana explore view
[16:29:41] <cwhite>	 Something's not right.  I see quantiles, but no histograms.
[16:33:47] <dcausse>	 this metric is quite new, following the train this week and thus only emitted from group0 at the moment 
[16:35:05] <cwhite>	 That shouldn't be a problem.  Statsd-exporter might not be configured correctly in k8s though.
[16:37:12] <dcausse>	 cwhite: thanks for looking into it! please let me know if you want me to file a ticket for this?
[16:54:12] <cwhite>	 I'll file one.  It's clear to me now that statsd-exporter is not configured to look at its configuration.
[16:55:29] <dcausse>	 cwhite: thanks!