[00:04:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prune_old_srv_syslog_directories.service on centrallog2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:09:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prune_old_srv_syslog_directories.service on centrallog1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:48:19] <denisse>	 ^ taking a look
[00:56:03] <denisse>	 I'm also filing a task for this as we've already seen this alert and it requires manual intervention.
[00:56:12] <denisse>	 I have a fix in mind, I'll send a patch in a moment.
[04:21:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:37:21] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:49:51] <godog>	 jayme: ^ likely related to the k8s audit logs, I'm guessing because the files are ingested from the start initially ?
[07:50:22] <jayme>	 godog: yeah...althoug it should not be much 
[07:50:39] <jayme>	 godog: its below a megabyte of logs
[07:51:22] <godog>	 ack ok, I'll dig around to confirm
[07:51:28] <godog>	 to confirm the lag is indeed the audit logs
[07:54:33] <godog>	 well or a side effect of that anyways
[07:58:25] <godog>	 mmhh looks like drop/webrequest from kubernetes_docker is doing a whole lot of filtering https://grafana.wikimedia.org/d/VCK8-FpZz/cwhite-logstash?orgId=1&refresh=1m&from=1712905078040&to=1712908678040&viewPanel=56
[07:58:37] <godog>	 so indeed related but not exactly that, I'll investigate
[08:08:49] <godog>	 ah ok got it, so the rsyslog.d file name change probably triggered rsyslog
[08:09:56] <godog>	 not 100% sure yet what triggered tho
[08:10:34] <godog>	 anyways other than investigating I don't think there's anything we can do, backlog should be clearing though
[08:15:43] <jayme>	 godog: mabye it's just a bunch of rsyslog restarts because of the file name change causing some of the rsyslog instances to come out of their deadlock state
[08:16:20] <godog>	 quite possible jayme yeah
[08:17:08] <jayme>	 so I should do something like this more often :-p
[08:17:29] <godog>	 hide-the-pain-harold.gifv
[08:18:40] <jayme>	 or we get your new team member to take a look replacing all the k8s rsyslog mess with fluent-bit ;-)
[08:19:29] <jayme>	 I recently read that it can even use the kubelet's api instead of the kube-apiserver to gather pod metadata - which is actually a pretty nice feature on its own
[08:20:29] <godog>	 interesting
[08:21:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:34:32] <jayme>	 godog: do you know if/when the mapping for new fiels in opensearch happens (https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.04.12?id=1Ahr0Y4B1Aouzw__Iw0d)? Only on index roll-over?
[08:52:21] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:53:27] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:58:27] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:00:11] <godog>	 jayme: yes indeed, it'll be tomorrow
[12:21:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:06:48] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on logstash2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:37:21] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:38:27] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group benthos-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:47:01] <cwhite>	 I haven't yet wrote this up in a task, but ^^ may be a symptom of something I observed when we initially rolled out Benthos to handle webrequest drops.  In short: adding benthos consumers drastically reduced the maximum rate of event consumption.
[16:10:37] <inflatador>	 Getting a puppet failure on datahubsearch1002, it doesn't want to update the opensearch fork of curator...has anyone seen this before? ref: https://phabricator.wikimedia.org/P60468
[16:13:00] <inflatador>	 `'5.8.5-1~wmf3+deb11u1' to '5.8.5-1~wmf4+deb11u1 failed`..per apt-browser, latest version is `elasticsearch-curator: 5.8.5-1~wmf3+deb11u1`  ...wonder why it thinks there's a new one
[17:05:27] <cwhite>	 inflatador: We rolled out ~wmf4 to the opensearch2 component and not for opensearch1.  There's no reason it can't, though.  I'll deploy the package.
[17:06:55] <cwhite>	 should be fixed by a `sudo apt update` now or given time :)
[19:04:08] <inflatador>	 cwhite thanks for the update, looks like Puppet is working now
[21:58:27] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:12:21] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:22:21] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in logging-eqiad for group apifeatureusage - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag