[12:41:49] Thanks! [15:02:41] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:32:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:34:35] FIRING: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:39:36] RESOLVED: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:48:35] FIRING: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [17:53:35] RESOLVED: [2x] ThanosSidecarNoConnectionToStartedPrometheus: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [19:19:46] i think grafana is down [19:20:00] it's also possible i gave it a query of death somehow, sorry if so [19:20:28] yeah slow to load for me... [19:20:41] cdanis: taking a look. [19:21:10] ssh slow as well hmm [19:22:26] just reported the same, ack. it appears ultra busy [19:23:16] maybe reboot via mgmt is easiest/quickest [19:23:24] just got a shell, heavy swapping yeah [19:23:36] going to reboot via ganeti [19:23:58] Rebooting SGTM. [19:24:33] is it a 4GB instance? [19:25:00] cdanis: yes 4G, spared no expense on the ram! [19:25:55] https://grafana.wikimedia.org back up 😅 [19:27:02] kinda wondering if we should have swap enabled on that host anyway [19:27:17] it would have been waaay better if it had gotten oom-killed instead of doing that [19:27:27] replicated network swap? what could possible go wrong [19:27:31] yeah agree [19:27:32] yeah also, that [19:30:33] swapoff'd manually, and we can think about adding some ram too if it happens again [19:31:10] herron: Should we disable swap in the Puppet config too? [19:31:51] I disabled it and ran puppet after and it no change I think its ok [19:32:39] I suspect swap or not is an artifact of the partman recipe [19:32:53] Ah, that makes sense. [19:33:00] Thanks. [19:35:11] I haven't managed to re-break it doing what I was doing before, at least