[02:13:25] (SystemdUnitFailed) firing: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:25] (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:08] (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:58:08] (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:27:26] (SystemdUnitFailed) firing: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:25] (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:22] claime: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/997819/2/modules/profile/manifests/mediawiki/php/monitoring.pp my thought/rationale is basically that removing nrpe::monitor_systemd_unit_state is ok because that's basically overlapping with SystemdUnitFailed, different story is systemd::service / systemd::monitor which default to "no monitoring" [10:10:01] hope that's a little more clear what I'm after [10:15:00] ok [10:15:05] thanks [10:15:36] basically they'll be reported by the catchall prometheus alert now [10:19:00] that's the idea yeah [10:19:53] with some goodies on top of that like crashloop detection, which we'll be upgrading to critical too I think [10:52:53] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:56:38] godog: We're going to have the same problem jayme pointed out with the mw and confd changes [10:56:44] some of these hosts are still buster [10:57:53] (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:00:41] claime: yeah I've updated the related task, tbh I don't think it is going to be a problem because crashloop detection isn't an icinga feature now [11:00:46] also ... buster [11:00:59] I agree [11:01:03] (re: buster) [11:01:49] but I'm not too keen on migrating mw nodes that are going to be re-imaged to k8s nodes over the next few months either [11:02:15] yeah that makes total sense to me [11:02:53] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:08:49] I've added "crashloop detection isn't an icinga feature" to the task as well to make it more prominent [11:19:13] sorry for the message mess on gerrit godog, it posted my "waiting" comment when I ran PCC [11:23:51] Ah right so we would just not benefit from the new crashloop detection on buster nodes, but the unit_state_failed alert will still work [11:24:08] yes [11:24:34] so might as well go ahead then [11:28:32] yeah that's right claime [12:06:25] (SystemdUnitFailed) firing: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:25] (SystemdUnitFailed) resolved: vo-escalate.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:48] these are temp 504s from VO API btw, I'll file a task [13:33:47] or timeouts [15:34:26] godog: sorry to bug you do you know about kafka-logging? [15:34:39] specifically kafka-logging2001 which is in network rack we are moving to new switches today [15:56:56] lmata: perhaps you might know about the kafka-logging box? I should have reached out earlier apologies [15:56:59] topranks: yup, was in meeting and afaik should tolerate a quick network blip [15:57:05] cc herron ^ [15:57:16] ha ok thanks! [15:57:25] (SystemdUnitFailed) firing: vo-escalate.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:26] they've been very quick thus far, just a few seconds [15:57:28] Thanks topranks! [15:57:53] topranks: yes ready to go! [15:58:04] great stuff thanks guys :) [16:02:25] (SystemdUnitFailed) resolved: (2) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:25] (SystemdUnitFailed) firing: statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:40] (SystemdUnitFailed) firing: (2) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:17] > Log details prior to the request and when the request completes - Pre request log is made at debug level. [16:12:33] don't exactly love the idea of enabling debug-level logging for production thanos but perhaps we need to [16:22:40] (SystemdUnitFailed) resolved: statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:41] error, success! [16:40:35] yeah I'm in meetings now, will take a closer look tomorrow [16:41:25] (SystemdUnitFailed) firing: rsync-loki-data.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:25] (SystemdUnitFailed) resolved: rsync-loki-data.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:55] (SystemdUnitFailed) firing: (2) grafana-loki.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:55] (SystemdUnitFailed) firing: grafana-loki.service Failed on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed