[12:32:41] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:37:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:08:19] godog was looking at adding hostname to ProbeDown alerts ( re: https://phabricator.wikimedia.org/T356140#9527892 ) . Do you have any more feedback/examples on how to do this? [13:11:25] inflatador: not really, sorry, making such things easier is one of the things we'll be doing this quarter [13:12:36] godog np. /me did a lot of transactional/ticket routing optimization stuff in a past life, so if I can help in any way feel free to reach out [13:12:50] for sure, thank you inflatador [13:43:10] folks logstash1036 is in eqiad rack E1 where we're doing a switch upgrade later today [13:43:26] https://phabricator.wikimedia.org/T365993 [13:43:44] the JunOS upgrade is going to mean ~20 mins downtime, is that likely to be an issue? [13:43:54] (sry should have touched base earlier on this) [13:47:15] topranks: afaik not a problem no, my recollection is that we might want to do some operation on the opensearch cluster beforehand to be nice (cc cwhite) [13:47:58] godog: ok thanks, if it seems like it will be a problem let me know it's possible to delay/postpone if needed [13:48:29] thanks for the ping. topranks, remind me what time the maint will begin? [13:48:48] scheudled for 15:00 UTC, so 72 minutes from now [13:49:52] no trouble at all. I'll make the necessary arrangements [13:49:54] thanks! [13:49:58] cwhite: super, thanks! [13:50:03] thank you cwhite topranks [13:50:45] FYI we also have logstash1037 and titan1001 coming up on Thur Jul 11th [13:51:12] https://phabricator.wikimedia.org/T365996 [13:51:27] cwhite: although we may need to postpone today's works I will let you know [13:51:34] for now I guess take no action? [13:52:54] ok, no problem either way. thanks for letting us know :) [13:54:50] indeed, thank you [13:55:09] sorry guys yeah, we're gonna postpone today's work until next Wednesday July 10th at the same time [13:58:15] ok np topranks [14:23:55] o/, looks like centrallog1001's NIC is getting a tad hot during spikes and I'm about to make it slightly worse, not urgent but I figured I should mention, do you want a ticket? https://grafana-rw.wikimedia.org/d/000000377/host-overview?forceLogin=&from=1719325279278&orgId=1&to=1719930079278&var-cluster=syslog&var-datasource=thanos&var-server=centrallog1002&refresh=5m&viewPanel=11 [14:24:09] o/ when migrating an icinga alert based on monitoring::graphite_threshold that uses a severity/threshold (e.g. https://gerrit.wikimedia.org/g/operations/puppet/+/b1bf8255e7b8ba9efcf27641530e54ee4dc6e7c5/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#44) [14:24:46] can we do something similar with alertmanager? or does it need 2 separate alerts? [14:26:27] kamila_: thank you for the heads up, unless it goes up to full saturation I think we're good [14:26:31] dcausse: checking [14:26:39] godog: ok, thanks :-) [14:27:18] (I'm not adding more volume, just more batching, so I won't saturate it) [14:27:52] dcausse: yeah two separate alerts (same name though) is how we do it, personally I'd recommend ditching the warning and keep only the critical [14:28:09] or at least start only with a critical one, and add back the warning [14:28:17] godog: sounds good, thanks! [14:28:43] sure np dcausse, thanks for reaching out [14:39:25] FIRING: SystemdUnitFailed: burrow-jumbo-eqiad.service on kafkamon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:45] huh Jul 02 14:34:21 kafkamon1003 Burrow[2449]: fatal error: runtime: out of memory [14:44:57] bounced the service just now [14:49:25] RESOLVED: SystemdUnitFailed: burrow-jumbo-eqiad.service on kafkamon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:28] when submitting a TimingMetric from MW StatsFactory is there anything I should do to get the related _bucket series? context is mediawiki_cirrus_search_request_time (https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/44639a2b42af8b04903c94c20c157f563acca211/includes/ElasticsearchIntermediary.php#277) but not finding it from thanos nor the grafana explore view [16:29:41] Something's not right. I see quantiles, but no histograms. [16:33:47] this metric is quite new, following the train this week and thus only emitted from group0 at the moment [16:35:05] That shouldn't be a problem. Statsd-exporter might not be configured correctly in k8s though. [16:37:12] cwhite: thanks for looking into it! please let me know if you want me to file a ticket for this? [16:54:12] I'll file one. It's clear to me now that statsd-exporter is not configured to look at its configuration. [16:55:29] cwhite: thanks!