[00:14:41] hey. I migrated Popups’ statsv metrics to use Prometheus via the new StatsFactory system. I updated reducer names (mediawiki_Popups_*), verified DogStatsD output locally with statsd-exporter, and added unit tests for the new metrics. Sanity check welcome. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/1147902 [00:51:24] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_corto.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_corto.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_corto.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:10] ^^ working on it [08:55:35] no need! [08:55:48] Borknagar [08:55:56] sorry, wrong paste [08:56:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148199 [08:56:43] ok thank you moritzm [09:02:41] good morning! Is it possible to know when exactly when we stopped having data when this message started to show up : problem = prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "gnmi_bfd_peer_session_state" metric in the last 1w (that's in codfw) [09:06:26] hey XioNoX, sure, I'll take a look. Could you tell me which log file you're seeing the message in? [09:06:57] tappof: I receive it by email ` 1 alert for alertname=AlertLintProblem team=netops ` [09:06:59] XioNoX: I don't see any data in thanos via grafana explore in the last 90 days [09:07:24] while I can see it for other dcs [09:07:26] volans: yeah, that's what I'm trying to figure out, did we never had the data? [09:07:42] okok XioNoX thanks [09:08:04] possibly not :) [09:08:43] haha, "21.2R3-S2.9" and I have a note "Possibly only starting in Junos 22.3" [09:08:57] XioNoX: https://w.wiki/EEVt [09:09:11] yeah [09:09:42] So we never received data for this metric from codfw [09:09:59] I'm curious why I only started to get notification recently, but it doesn't matter much [09:10:29] tappof: another thing, the email says : https://www.irccloud.com/pastebin/yOG0lsTe/ [09:10:43] but it didn't open a task, only emails [09:10:47] Did you add an alert for this metric recently? [09:13:04] tappof: long time ago (in april) https://github.com/wikimedia/operations-alerts/commit/95c3eed95d38749b74e4442cc9ba87ba28243cb6 [09:18:35] tappof: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1148281 [09:18:39] XioNoX: From Thanos's point of view, the alert has been present since 2024-04-16 https://w.wiki/EEW8. Maybe we should check the notifications. [09:27:58] tappof: thanks! [09:36:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_corto.service on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-sampler - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [12:20:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-sampler - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-sampler - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [12:25:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:32:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:42:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:45:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:50:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:36:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:32:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:37:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:39:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:36:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed