[00:02:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [00:02:52] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [00:06:41] (PrometheusRuleEvaluationFailures) firing: (9) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [00:06:52] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [00:07:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (13) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [00:11:41] (PrometheusRuleEvaluationFailures) firing: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [00:12:41] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [00:15:15] The 'ThanosSidecarNoConnectionToStartedPrometheus' alerts must be resolved by now. I had to manually stop the Prometheus's processes on prometheus1005 after 'systemctl stop prometheus@k8s' was unresponsive. [00:16:09] Then I backed up the corrupted WAL to start with a fresh one. I'm taking a look at the 'PrometheusRuleEvaluationFailures' alerts. [00:23:05] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (10) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [00:27:50] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (10) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [00:33:11] (PrometheusRuleEvaluationFailures) resolved: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [00:39:08] The issues must be resolved by now, I'll be monitoring the alerts to see if the alerts trigger again. [00:39:22] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [00:39:33] Additionally, while running the Puppet agent on the Prometheus2* hosts I noticed this warning message: Warning: The directory '/srv/prometheus' contains 16904 entries, which exceeds the default soft limit 1000 and may cause excessive resource consumption and degraded performance. To remove this warning set a value for `max_files` parameter or consider using an alternate method to manage large directory trees [00:45:18] An important thing to note is that due to the WAL backup and stopping/starting the Prometheus services we lost data for about 9secconds from 2024-03-21 00:04:43 to 2024-03-21 00:15:56 UTC. [00:45:33] Aside from that, the graphs and services look healthy now. [09:45:05] hello, I have filed https://phabricator.wikimedia.org/T360595 cause ElasticSearch on logging-logstash-02.logging.eqiad1.wikimedia.cloud yields a different response than expected by a deployment script (the elastic search response lacks an "aggregate" field) [09:45:10] the host is logging-logstash-02.logging.eqiad1.wikimedia.cloud [09:45:27] and that currently blocks the automatic updates of the beta cluster, I have thus marked the task as an unbreak now [09:45:38] it started failing on March 12 at ~ 00:46 UTC [09:45:42] reason unknown :/ [09:55:08] hashar: the _shards section is suspicious (total:0) [09:55:45] apparently there is nothing in there [09:56:00] and looking over Kibana via https://beta-logs.wmcloud.org/ , there is no log showing up [09:56:10] so maybe the backend ES got wiped last night [09:56:18] and / or beta cluster does not emit anything to it [10:23:51] thank you de.nisse for taking a look at prometheus, I'm looking too at what was going on [10:28:40] hashar: I don't know very much about that setup unfortunately, I can take a look this afternoon though [11:42:14] hashar, godog: disk is full and logs aren't being ingested. I suspect ssl on the Kafka hosts in deployment-prep but I can't look closer right now. [11:43:56] cwhite: <3 <3 thank you! that's a good hint and starting point, I'll take a look after lunch [12:31:30] Cole is the hero of the day :] [12:33:30] Warning: Permanently added 'deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud' (ED25519) to the list of known hosts. [12:33:30] /dev/sda1 20G 19G 339M 99% / [12:33:40] cause of course, it does not have a `/srv` [12:37:07] Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] ERROR [Controller id=1001, targetBrokerId=1001] Connection to node 1001 failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient) [12:37:07] Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] WARN Failed to send SSL Close message (org.apache.kafka.common.network.SslTransportLayer) [12:37:09] it is magic [12:54:30] and that host is missing some Debian packages [12:55:23] kafka-kit and kafka-kit-prometheus-metricsfetcher [15:12:57] I'm looking at fixing beta logs btw [15:16:50] I'm nuking a bunch of old indices [15:21:40] actually nevermind, those are small and not the culprit [15:24:49] but yeah now on to discover why the ssl certs for kafka didn't get auto renewed [15:27:55] in prod the brokers need a manual restart to pick up the renewed certs on disk, maybe the case there too? [15:29:32] I wish, there are no new certs I could find [15:29:49] oh thats fun [15:30:12] and moving /etc/kafka/ssl out of the way doesn't recreate the certs [15:30:36] ok I've put /etc/kafka/ssl back to the state it was [15:32:06] that's about the extent of the debugging I'm able to do rn [15:32:16] did y'all merge the changes for the k8s metrics? [15:32:50] claime: this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012995?usp=dashboard yes [15:33:12] I'm missing the last 10 minutes of https://grafana.wikimedia.org/goto/JqIxba1Iz?orgId=1 [15:33:17] But that may just be on my side [15:33:36] checking [15:37:00] the timing does line up [15:37:24] Yeah I'm missing more or less all envoy telemetry right now, idk if it's due to the change, or if it's because the change needs a prom restart and it's not totally back up yet? [15:38:34] the prometheus part should be done, in the sense that it is a reload [15:38:58] I've been checking https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry and didn't see changes, what else is missing claime ? [15:39:33] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s [15:40:21] doh of course what I pasted was for 'ops' )o) [15:40:22] my bad [15:40:57] I'll revert [15:41:20] ty <3 [15:41:54] there is currently all this work to replace puppet CA with cfssl to create certs, fwiw [15:42:09] was wondering for a sec if it can be related here [15:43:08] also profile::prometheus::blackbox_exporter was deleted and recreated today [15:43:17] and " # puppet agent certs exported in profile::prometheus::blackbox_exporter [15:43:40] refers to /etc/prometheus/ssl/ though [15:47:07] yeah doubt that's it mutante [15:47:21] so it looks like the deployment puppetmaster switched from 03 to 04 [15:47:39] I see the expired cert defined in cergen, but going to regenerate it... [15:47:54] on deployment-puppetmaster04 -bash: cergen: command not found [15:48:50] "cergen is currently only installed on puppetmaster1001 by means of the cergen Puppet class. Even building cergen for Buster proved to be challenging back then, as it needs python-networkx 1 and even back then needed python3-lib2to3" [15:48:55] https://phabricator.wikimedia.org/T357750 [15:49:19] herron: buster or newer I guess [15:49:48] thanks, hmm [15:50:08] this is all today https://phabricator.wikimedia.org/T360598 [15:50:15] about kafka-main certs [15:50:47] ah yeah, those are running shiny new cfssl certs [15:51:00] I wish that were the case here too [15:54:39] godog: it's back, ty [15:55:03] claime: indeed, sorry about not catching/reverting earlier [15:55:19] happens, no worries :) [15:55:19] turns out, I need to look at the right dashboards and it was a little too good to be true [15:55:22] lol [15:55:39] Spend your day looking at dashboards you don't know which is the right one at the end [15:55:48] quite true [15:57:42] that change alone reduced ingested samples/s by 60%, hence "too good to be true" [16:06:35] and I goofed the change altogether, that's not how you pass query strings, so ok fair enough no envoy metrics were being scraped [16:07:25] very (very) effective at limiting metrics [16:12:06] yes indeed [17:17:15] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:22:15] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:57:15] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:07:15] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:17:15] (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:22:15] (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag