[00:16:47] The network probes issue seems to have resolved around 17:07Z coinciding with the idp tomcat restart. [00:52:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [00:57:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:19:29] hey, people, I am observing some weird behaviour [08:20:14] I see prometheus100[56] attempting to connect to tcp ganeti1017:1811, but that port is not open [08:20:44] https://logstash.wikimedia.org/goto/6101c60e2e292ab1520cd9091b534c42 [08:21:26] This was causing probe alerts- but I see those not happing anymore (?) [08:24:23] maybe this is not obs, but service owner's missconfiguration [08:24:40] but maybe you can give me some pointers [08:25:48] jynus: https://phabricator.wikimedia.org/rOPUPde21a79eedbba78093a37d71f9574aa44a53029a it's being decom or worked on by moritzm [08:25:57] I see [08:26:09] same for cassandra: https://phabricator.wikimedia.org/T380236 [08:26:16] (restbase) [08:26:55] I'm going to downtime those probes for a few hours [08:29:09] ganeti1017 is some alert spam which happens if a Ganeti node is removed from the active cluster(s) for decom, they resolve within a few minutes [08:30:31] ok. no worries. I was just checking those because yesterday other probes had worse consequences [08:30:58] So I though they could be false alerts [13:20:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [14:30:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag