[02:23:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [02:33:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [13:26:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [13:41:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [14:04:48] FIRING: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:14:49] RESOLVED: PuppetFailure: Puppet has failed on logging-hd2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:23:32] Hey 0lly, we had a problem yesterday IRC alert routing ( T402418 ), what's the best way to troubleshoot that? I've been poking on alerts1002 but don't really know what to look for [17:23:32] T402418: Figure out why some alerts aren't making it to #wikimedia-data-platform-alerts IRC - https://phabricator.wikimedia.org/T402418 [17:28:58] inflatador_: The alert LGTM, my first guess would be that the bot could've been down at the time? [17:36:06] denisse ACK, thanks for taking a look...it does look like one of the services on `alert1002` was restarted fairly recently [17:36:15] `for n in $(systemctl list-units | grep irc | grep service | awk '{print $1}'); do systemctl status ${n} | grep since; done` [17:36:25] but I have no real idea, I'm just throwin' stuff out there ;P [20:54:35] FIRING: [2x] DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-dse 0.7785% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:39:35] RESOLVED: [2x] DiskSpace: Disk space prometheus1007:9100:/srv/prometheus/k8s-dse 0.7025% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace