[00:02:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[00:02:52] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[00:06:41] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: (9) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[00:06:52] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[00:07:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: (13) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[00:11:41] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) firing: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[00:12:41] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (4) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[00:15:15] <denisse>	 The 'ThanosSidecarNoConnectionToStartedPrometheus' alerts must be resolved by now. I had to manually stop the Prometheus's processes on prometheus1005 after 'systemctl stop prometheus@k8s' was unresponsive.
[00:16:09] <denisse>	 Then I backed up the corrupted WAL to start with a fresh one. I'm taking a look at the 'PrometheusRuleEvaluationFailures' alerts.
[00:23:05] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: (10) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[00:27:50] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (10) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[00:33:11] <jinxer-wm>	 (PrometheusRuleEvaluationFailures) resolved: (37) Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus  - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures
[00:39:08] <denisse>	 The issues must be resolved by now, I'll be monitoring the alerts to see if the alerts trigger again.
[00:39:22] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[00:39:33] <denisse>	 Additionally, while running the Puppet agent on the Prometheus2* hosts I noticed this warning message: Warning: The directory '/srv/prometheus' contains 16904 entries, which exceeds the default soft limit 1000 and may cause excessive resource consumption and degraded performance. To remove this warning set a value for `max_files` parameter or consider using an alternate method to manage large directory trees
[00:45:18] <denisse>	 An important thing to note is that due to the WAL backup and stopping/starting the Prometheus services we lost data for about 9secconds from 2024-03-21 00:04:43 to 2024-03-21 00:15:56 UTC.
[00:45:33] <denisse>	 Aside from that, the graphs and services look healthy now.
[09:45:05] <hashar>	 hello, I have filed https://phabricator.wikimedia.org/T360595 cause ElasticSearch on logging-logstash-02.logging.eqiad1.wikimedia.cloud yields a different response than expected by a deployment script (the elastic search response lacks an "aggregate" field)
[09:45:10] <hashar>	 the host is logging-logstash-02.logging.eqiad1.wikimedia.cloud
[09:45:27] <hashar>	 and that currently blocks the automatic updates of the beta cluster, I have thus marked the task as an unbreak now 
[09:45:38] <hashar>	 it started failing on March 12 at ~ 00:46 UTC
[09:45:42] <hashar>	 reason unknown :/
[09:55:08] <dcausse>	 hashar: the _shards section is suspicious (total:0)
[09:55:45] <hashar>	 apparently there is nothing in there
[09:56:00] <hashar>	 and looking over Kibana via https://beta-logs.wmcloud.org/  , there is no log showing up
[09:56:10] <hashar>	 so maybe the backend ES got wiped last night
[09:56:18] <hashar>	 and / or beta cluster does not emit anything to it
[10:23:51] <godog>	 thank you de.nisse for taking a look at prometheus, I'm looking too at what was going on
[10:28:40] <godog>	 hashar: I don't know very much about that setup unfortunately, I can take a look this afternoon though
[11:42:14] <cwhite>	 hashar, godog: disk is full and logs aren't being ingested.  I suspect ssl on the Kafka hosts in deployment-prep but I can't look closer right now. 
[11:43:56] <godog>	 cwhite: <3 <3 thank you! that's a good hint and starting point, I'll take a look after lunch
[12:31:30] <hashar>	 Cole is the hero of the day :]
[12:33:30] <hashar>	 Warning: Permanently added 'deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud' (ED25519) to the list of known hosts.
[12:33:30] <hashar>	 /dev/sda1        20G   19G  339M  99% /
[12:33:40] <hashar>	 cause of course, it does not have a `/srv`
[12:37:07] <hashar>	 Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] ERROR [Controller id=1001, targetBrokerId=1001] Connection to node 1001 failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)
[12:37:07] <hashar>	 Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
[12:37:09] <hashar>	 it is magic
[12:54:30] <hashar>	 and that host is missing some Debian packages
[12:55:23] <hashar>	 kafka-kit and kafka-kit-prometheus-metricsfetcher
[15:12:57] <godog>	 I'm looking at fixing beta logs btw
[15:16:50] <godog>	 I'm nuking a bunch of old indices
[15:21:40] <godog>	 actually nevermind, those are small and not the culprit
[15:24:49] <godog>	 but yeah now on to discover why the ssl certs for kafka didn't get auto renewed
[15:27:55] <herron>	 in prod the brokers need a manual restart to pick up the renewed certs on disk, maybe the case there too?
[15:29:32] <godog>	 I wish, there are no new certs I could find
[15:29:49] <herron>	 oh thats fun
[15:30:12] <godog>	 and moving /etc/kafka/ssl out of the way doesn't recreate the certs
[15:30:36] <godog>	 ok I've put /etc/kafka/ssl back to the state it was
[15:32:06] <godog>	 that's about the extent of the debugging I'm able to do rn
[15:32:16] <claime>	 did y'all merge the changes for the k8s metrics?
[15:32:50] <godog>	 claime: this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012995?usp=dashboard yes
[15:33:12] <claime>	 I'm missing the last 10 minutes of https://grafana.wikimedia.org/goto/JqIxba1Iz?orgId=1
[15:33:17] <claime>	 But that may just be on my side
[15:33:36] <godog>	 checking
[15:37:00] <godog>	 the timing does line up
[15:37:24] <claime>	 Yeah I'm missing more or less all envoy telemetry right now, idk if it's due to the change, or if it's because the change needs a prom restart and it's not totally back up yet?
[15:38:34] <godog>	 the prometheus part should be done, in the sense that it is a reload
[15:38:58] <godog>	 I've been checking https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry and didn't see changes, what else is missing claime ?
[15:39:33] <claime>	 https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s
[15:40:21] <godog>	 doh of course what I pasted was for 'ops' )o)
[15:40:22] <godog>	 my bad
[15:40:57] <godog>	 I'll revert
[15:41:20] <claime>	 ty <3
[15:41:54] <mutante>	 there is currently all this work to replace puppet CA with cfssl to create certs, fwiw
[15:42:09] <mutante>	 was wondering for a sec if it can be related here
[15:43:08] <mutante>	 also profile::prometheus::blackbox_exporter was deleted and recreated today
[15:43:17] <mutante>	 and "    # puppet agent certs exported in profile::prometheus::blackbox_exporter
[15:43:40] <mutante>	 refers to /etc/prometheus/ssl/ though
[15:47:07] <godog>	 yeah doubt that's it mutante 
[15:47:21] <herron>	 so it looks like the deployment puppetmaster switched from 03 to 04
[15:47:39] <herron>	 I see the expired cert defined in cergen, but going to regenerate it...
[15:47:54] <herron>	 on deployment-puppetmaster04 -bash: cergen: command not found
[15:48:50] <mutante>	 "cergen is currently only installed on puppetmaster1001 by means of the cergen Puppet class. Even building cergen for Buster proved to be challenging back then, as it needs python-networkx 1 and even back then needed python3-lib2to3"
[15:48:55] <mutante>	 https://phabricator.wikimedia.org/T357750
[15:49:19] <mutante>	 herron: buster or newer I guess
[15:49:48] <herron>	 thanks, hmm
[15:50:08] <mutante>	 this is all today  https://phabricator.wikimedia.org/T360598
[15:50:15] <mutante>	 about kafka-main certs
[15:50:47] <herron>	 ah yeah, those are running shiny new cfssl certs
[15:51:00] <herron>	 I wish that were the case here too
[15:54:39] <claime>	 godog: it's back, ty
[15:55:03] <godog>	 claime: indeed, sorry about not catching/reverting earlier
[15:55:19] <claime>	 happens, no worries :)
[15:55:19] <godog>	 turns out, I need to look at the right dashboards and it was a little too good to be true
[15:55:22] <claime>	 lol
[15:55:39] <claime>	 Spend your day looking at dashboards you don't know which is the right one at the end
[15:55:48] <godog>	 quite true
[15:57:42] <godog>	 that change alone reduced ingested samples/s by 60%, hence "too good to be true"
[16:06:35] <godog>	 and I goofed the change altogether, that's not how you pass query strings, so ok fair enough no envoy metrics were being scraped
[16:07:25] <herron>	 very (very) effective at limiting metrics
[16:12:06] <godog>	 yes indeed
[17:17:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:22:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[21:57:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:07:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:17:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:22:15] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag