[00:04:25] (SystemdUnitFailed) firing: prune_old_srv_syslog_directories.service on centrallog2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:25] (SystemdUnitFailed) firing: (2) prune_old_srv_syslog_directories.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:19] ^ taking a look [00:56:03] I'm also filing a task for this as we've already seen this alert and it requires manual intervention. [00:56:12] I have a fix in mind, I'll send a patch in a moment. [04:21:48] (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:37:21] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:49:51] jayme: ^ likely related to the k8s audit logs, I'm guessing because the files are ingested from the start initially ? [07:50:22] godog: yeah...althoug it should not be much [07:50:39] godog: its below a megabyte of logs [07:51:22] ack ok, I'll dig around to confirm [07:51:28] to confirm the lag is indeed the audit logs [07:54:33] well or a side effect of that anyways [07:58:25] mmhh looks like drop/webrequest from kubernetes_docker is doing a whole lot of filtering https://grafana.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&refresh=1m&from=1712905078040&to=1712908678040&viewPanel=56 [07:58:37] so indeed related but not exactly that, I'll investigate [08:08:49] ah ok got it, so the rsyslog.d file name change probably triggered rsyslog [08:09:56] not 100% sure yet what triggered tho [08:10:34] anyways other than investigating I don't think there's anything we can do, backlog should be clearing though [08:15:43] godog: mabye it's just a bunch of rsyslog restarts because of the file name change causing some of the rsyslog instances to come out of their deadlock state [08:16:20] quite possible jayme yeah [08:17:08] so I should do something like this more often :-p [08:17:29] hide-the-pain-harold.gifv [08:18:40] or we get your new team member to take a look replacing all the k8s rsyslog mess with fluent-bit ;-) [08:19:29] I recently read that it can even use the kubelet's api instead of the kube-apiserver to gather pod metadata - which is actually a pretty nice feature on its own [08:20:29] interesting [08:21:48] (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:34:32] godog: do you know if/when the mapping for new fiels in opensearch happens (https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.04.12?id=1Ahr0Y4B1Aouzw__Iw0d)? Only on index roll-over? [08:52:21] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:53:27] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:58:27] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:00:11] jayme: yes indeed, it'll be tomorrow [12:21:48] (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:06:48] (PuppetFailure) resolved: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:37:21] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:38:27] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:47:01] I haven't yet wrote this up in a task, but ^^ may be a symptom of something I observed when we initially rolled out Benthos to handle webrequest drops. In short: adding benthos consumers drastically reduced the maximum rate of event consumption. [16:10:37] Getting a puppet failure on datahubsearch1002, it doesn't want to update the opensearch fork of curator...has anyone seen this before? ref: https://phabricator.wikimedia.org/P60468 [16:13:00] `'5.8.5-1~wmf3+deb11u1' to '5.8.5-1~wmf4+deb11u1 failed`..per apt-browser, latest version is `elasticsearch-curator: 5.8.5-1~wmf3+deb11u1` ...wonder why it thinks there's a new one [17:05:27] inflatador: We rolled out ~wmf4 to the opensearch2 component and not for opensearch1. There's no reason it can't, though. I'll deploy the package. [17:06:55] should be fixed by a `sudo apt update` now or given time :) [19:04:08] cwhite thanks for the update, looks like Puppet is working now [21:58:27] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:12:21] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:22:21] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag