[17:23:03] FIRING: ErrorBudgetBurn: logstash-availability eqiad - https://slo.wikimedia.org/?search=logstash-availability - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:26:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:41:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest_live - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest_live - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:48:03] RESOLVED: ErrorBudgetBurn: logstash-availability eqiad - https://slo.wikimedia.org/?search=logstash-availability - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:17:33] FIRING: ErrorBudgetBurn: logstash-availability eqiad - https://slo.wikimedia.org/?search=logstash-availability - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:30:54] Something's not quite adding up about that errorbudgetburn alert. One of six logstash hosts wasn't ingesting but we didn't experience a service outage. [18:53:39] cwhite: I'm thinking we should wrap that recording rule expression in a sum by (site), current expr is on the top and proposed on the bottom https://w.wiki/E3fW [19:02:33] RESOLVED: ErrorBudgetBurn: logstash-availability eqiad - https://slo.wikimedia.org/?search=logstash-availability - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:04:10] herron: that sounds reasonable. Should be better than allowing a single host to hit the SLO so hard :) [19:35:31] hi folks - is there anything amiss with grafana at the moment? folks are reporting that it's unavailable, and I'm getting an envoy 503 when attempting to access grafana.wm.o [19:36:27] Hi swfrench-wmf , I recently updates it to Grafana V11. I tested and everything looked fine, but let me take a look again. [19:36:42] ack, thank you! [19:51:30] swfrench-wmf: Could you please try to access grafana again? [19:51:44] denisse: up for me! [19:51:53] swfrench-wmf: Thank you! [19:51:58] out of curiosity, what was it? :) [19:52:02] thank _you_! [19:53:44] It seems like envoy was trying to use the previous grafana version, after the Grafana upgrade the grafana server was restarted but envoyproxy was not. Grafana introduced new headers in version 10 which envoy didn't know about until it was restarted. [19:55:07] I tested grafana-rw and it works fine. Now both grafana hosts have v11.6.1. [19:55:26] I'm puzzled as to why I didn't had to reboot envoy after upgrading the first host tho... [19:55:31] restart* [19:57:45] denisse Looks like it's working, thanks for your help! [19:58:08] inflatador: Awesome, thank you! [20:01:15] swfrench-wmf: Just FYI, I'm still trying to figure out the root cause, looking at the grafana logs I see there's an envoyproxy restarter service, so maybe it wasn't that envoy didn't restart. It seems like it did, I'm wondering if it could be related to the Grafana service not reloading correctly after the package upgrade, I'll keep you posted. [20:01:21] > May 08 19:51:01 grafana1002 envoyproxy-hot-restarter [20:04:25] FIRING: SystemdUnitFailed: stunnel4.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:48] ^ On it. [20:07:25] It must resolve soon. [20:09:25] RESOLVED: SystemdUnitFailed: stunnel4.service on grafana1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:16] Okay, so looking at the logs it seems like envoyproxy-hot-restarter didn't trigger on time after the grafana upgrade. Grafana was upgraded at `19:47:02` and performed migrations successfully but from that same time period the oldest envoyproxy log is `May 08 19:47:30 grafana1002 systemd[1]: Stopping envoyproxy.service - Envoy proxy...` which is when I manually restarted the `envoyproxy` service. [20:23:01] So that's why the issue happened. Now, I'll investigate why `envoyproxy-hot-restarter` didn't trigger after the upgrade, it seems like it didn't detect it. [21:41:40] interesting! I take it there was a configuration change that needed to be picked up by envoy in conjunction with the grafana upgrade?