[08:20:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [08:25:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [13:04:40] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:00:01] FYI I've investigated the kafka lag above, see -sre [16:02:01] Ben is taking a look [16:03:33] Thanks, taking a look. [16:34:55] Hello team, mutante and I are working on migrating the Prometheus hosts to discovery and we have a question regarding PoP hosts URLs. [16:35:10] What would be the difference between https://prometheus-eqiad.wikimedia.org and https://prometheus-eqsin.wikimedia.org ? [16:38:37] Is it necessary/useful for SREs to be able to access separate Prometheus URLs for each DC? [16:45:19] denisse: the TLDR is yes, background is in https://phabricator.wikimedia.org/T301944 [16:45:56] Thanks, I'm taking a look. [17:04:40] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:05:09] The alert is triggering again, I'm looking at the graphs. [17:07:00] Looking at the graphs the lag is going down steadily, it may self resolve. I'll be monitoring it. [17:27:23] The lag continues to decrease. [17:44:53] created: https://phabricator.wikimedia.org/T363856 [19:23:13] looks like staging k8s logs aren't in logstash? https://logstash.wikimedia.org/goto/386d707856fd6f4d94043299c77c67d5 . I've tried several namespaces and they seem to disappear around 1745 UTC? [19:23:24] going to ping in serviceops as well [19:31:46] inflatador: logs may be delayed because of T363856 [19:31:47] T363856: datahub-mae-consumer producing logs at excessive rate - https://phabricator.wikimedia.org/T363856 [19:32:38] cwhite ACK, sounds like it's a known issue then [20:09:30] cwhite my team (Data Platform SRE) owns datahub, if this is an emergency LMK. I can try and roust one of my Euro colleagues who knows more about the service [20:11:37] actually it looks like btullis is already on it...he's uninstalled the service from staging [20:12:07] inflatador: it looks like datahub-mae-consumer stopped misbehaving at 1610Z (~4h ago). we're ingesting the backlog still [21:17:25] FIRING: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:42] ^ That's me, I'm working on it. [21:30:22] except it seems to be so unrelated to the change that was merged. [21:30:25] too many open files\nerror creating index client\ngithub.com/grafana/loki/pkg/storage/chunk/storage.NewStore [22:59:55] RESOLVED: SystemdUnitFailed: grafana-loki.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed