[00:32:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [04:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [05:21:42] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:17:54] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:51:45] btullis: there is a long running query (41 days) running on dbstore1003 (s7) - do you want me to intervene before the server explodes? [07:52:12] it is from a research user [08:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [09:19:34] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:23:14] I pinger D.E. on other channel and killed the query [09:23:20] with their permission [09:23:28] *pinged [09:32:45] Not sure how to respond to the swift_ring_manager thingy, though [09:33:19] ^ godog may know more about that / expected issue maybe? [09:33:58] jynus: mmhh not really, Emperor would know I think [09:34:03] ah [09:34:29] I think I may ask both of you about more insight about swift, thanos, and other storage things [09:34:51] at least to understand all components we have for monitoring and storage [09:35:23] as in a 101, because I feel like an idiot now [09:36:58] easy to say but "don't", there's a lot of stuff going on! [09:37:20] he he [09:38:18] for example, the blackbox exporter was very simple but your presentation was useful to contextalize it [09:39:03] so looking forward for moro of those! [09:40:14] hehe thank you, yeah definitely we should do more of those preos [09:40:17] presos [09:40:35] it failed 'cos swift-dispersion-report failed [09:41:20] (though that now works OK, so...) [09:53:50] Emperor: unrelated, I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/806166 out for your eyes, sort-of urgent as centrallog is quickly filling up, I've added more context to the related task too [09:54:25] ah, I was looking at central log host now [09:57:00] I thought at first it was due to some logging issues from aqs that happened yesterday, I can see not [09:57:19] yeah [10:09:42] sigh I forgot the rsyslog-swift config is *after* sending to centrallog [10:22:10] sorry if I am too bossy today, when all DBAs are out, I become hungy with power (aka worried about everyhing that can go wrong on my own) [10:24:53] ok fix is at https://gerrit.wikimedia.org/r/c/operations/puppet/+/806173 [10:24:59] gotta go to lunch, bbiab [11:10:57] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:00:36] jynus mutante I am leaving db2078 stopped till the phab bug is confirmed fixed by the patch [12:14:44] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [13:20:01] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:03:09] btullis: Thanks for all of your work on T306181 [15:03:09] T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 [15:05:23] Wrong channel. I'll switch to -analytics :D [15:12:23] I will extend the m3 codfw downtime until monday, just in case [15:27:47] +1 [16:16:24] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [17:12:58] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:23:30] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:47:27] phab bug fixed, restarting replication [18:48:02] (we can always rollback to the dump, but now seems very unlikely) [18:51:26] it should take a couple of hours to catch up- I have acked the 2 lag alerts for now [18:56:33] +1 [18:56:42] thanks [19:17:44] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:23:00] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:32:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [20:37:09] thanks jynus [22:16:08] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers