[00:49:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:25] FIRING: SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:44] godog: I hope it makes sense like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128798 [10:13:25] vgutierrez: neat, I'll take a closer look today [10:54:25] FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [13:46:59] taking a look at the alerts [13:49:37] sure enough, mtail hogging the cpu again on centrallog2002 ;_; [13:55:33] re: curator, looks like this has timed out POST http://127.0.0.1:9200/logstash-k8s-1-7.0.0-1-2025.03.16/_forcemerge?max_num_segments=1 [13:56:12] * cwhite kicks curator off again [13:56:33] thank you [13:56:58] in the meantime I'm opening a task to investigate if we really need mtail on centrallog, I suspect not [13:59:25] RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logging-sd1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [14:42:24] heads up that we merged a bad Puppet patch in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128547 . We're fixing it now, but FYI in case y'all see puppet errors on your opensearch hosts [15:00:14] Looks like it's fixed. I ran puppet on logstash1033 and didn't see any errors...LMK if y'all see anything though [17:23:52] FIRING: ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [17:28:52] RESOLVED: ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures