[02:13:25] <jinxer-wm>	 (SystemdUnitFailed) firing: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:18:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:58:08] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:58:08] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:27:26] <jinxer-wm>	 (SystemdUnitFailed) firing: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:32:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:09:22] <godog>	 claime: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/997819/2/modules/profile/manifests/mediawiki/php/monitoring.pp my thought/rationale is basically that removing nrpe::monitor_systemd_unit_state is ok because that's basically overlapping with SystemdUnitFailed, different story is systemd::service / systemd::monitor which default to "no monitoring"
[10:10:01] <godog>	 hope that's a little more clear what I'm after
[10:15:00] <claime>	 ok
[10:15:05] <claime>	 thanks
[10:15:36] <claime>	 basically they'll be reported by the catchall prometheus alert now
[10:19:00] <godog>	 that's the idea yeah
[10:19:53] <godog>	 with some goodies on top of that like crashloop detection, which we'll be upgrading to critical too I think
[10:52:53] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:56:38] <claime>	 godog: We're going to have the same problem jayme pointed out with the mw and confd changes
[10:56:44] <claime>	 some of these hosts are still buster
[10:57:53] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:00:41] <godog>	 claime: yeah I've updated the related task, tbh I don't think it is going to be a problem because crashloop detection isn't an icinga feature now
[11:00:46] <godog>	 also ... buster
[11:00:59] <claime>	 I agree
[11:01:03] <claime>	 (re: buster)
[11:01:49] <claime>	 but I'm not too keen on migrating mw nodes that are going to be re-imaged to k8s nodes over the next few months either
[11:02:15] <godog>	 yeah that makes total sense to me
[11:02:53] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:08:49] <jayme>	 I've added "crashloop detection isn't an icinga feature" to the task as well to make it more prominent
[11:19:13] <claime>	 sorry for the message mess on gerrit godog, it posted my "waiting" comment when I ran PCC
[11:23:51] <claime>	 Ah right so we would just not benefit from the new crashloop detection on buster nodes, but the unit_state_failed alert will still work
[11:24:08] <jayme>	 yes
[11:24:34] <claime>	 so might as well go ahead then
[11:28:32] <godog>	 yeah that's right claime 
[12:06:25] <jinxer-wm>	 (SystemdUnitFailed) firing: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:11:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:29:48] <godog>	 these are temp 504s from VO API btw, I'll file a task
[13:33:47] <godog>	 or timeouts
[15:34:26] <topranks>	 godog: sorry to bug you do you know about kafka-logging? 
[15:34:39] <topranks>	 specifically kafka-logging2001 which is in network rack we are moving to new switches today
[15:56:56] <topranks>	 lmata: perhaps you might know about the kafka-logging box?  I should have reached out earlier apologies 
[15:56:59] <godog>	 topranks: yup, was in meeting and afaik should tolerate a quick network blip
[15:57:05] <godog>	 cc herron ^
[15:57:16] <topranks>	 ha ok thanks!
[15:57:25] <jinxer-wm>	 (SystemdUnitFailed) firing: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:57:26] <topranks>	 they've been very quick thus far, just a few seconds 
[15:57:28] <lmata>	 Thanks topranks!
[15:57:53] <herron>	 topranks: yes ready to go!
[15:58:04] <topranks>	 great stuff thanks guys :)
[16:02:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:03:25] <jinxer-wm>	 (SystemdUnitFailed) firing: statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:07:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:12:17] <cdanis>	 > Log details prior to the request and when the request completes - Pre request log is made at debug level.
[16:12:33] <cdanis>	 don't exactly love the idea of enabling debug-level logging for production thanos but perhaps we need to
[16:22:40] <jinxer-wm>	 (SystemdUnitFailed) resolved: statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:23:41] <herron>	 error, success!
[16:40:35] <godog>	 yeah I'm in meetings now, will take a closer look tomorrow
[16:41:25] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-loki-data.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: rsync-loki-data.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:57:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) grafana-loki.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:57:55] <jinxer-wm>	 (SystemdUnitFailed) firing: grafana-loki.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed