[00:32:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[04:32:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[05:21:42] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:17:54] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:51:45] <jynus>	 btullis: there is a long running query (41 days) running on dbstore1003 (s7) - do you want me to intervene before the server explodes?
[07:52:12] <jynus>	 it is from a research user
[08:32:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[09:19:34] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:23:14] <jynus>	 I pinger D.E. on other channel and killed the query
[09:23:20] <jynus>	 with their permission
[09:23:28] <jynus>	 *pinged
[09:32:45] <jynus>	 Not sure how to respond to the swift_ring_manager thingy, though
[09:33:19] <jynus>	 ^ godog may know more about that / expected issue maybe?
[09:33:58] <godog>	 jynus: mmhh not really, Emperor would know I think
[09:34:03] <jynus>	 ah
[09:34:29] <jynus>	 I think I may ask both of you about more insight about swift, thanos, and other storage things
[09:34:51] <jynus>	 at least to understand all components we have for monitoring and storage
[09:35:23] <jynus>	 as in a 101, because I feel like an idiot now
[09:36:58] <godog>	 easy to say but "don't", there's a lot of stuff going on!
[09:37:20] <jynus>	 he he
[09:38:18] <jynus>	 for example, the blackbox exporter was very simple but your presentation was useful to contextalize it
[09:39:03] <jynus>	 so looking forward for moro of those!
[09:40:14] <godog>	 hehe thank you, yeah definitely we should do more of those preos
[09:40:17] <godog>	 presos
[09:40:35] <Emperor>	 it failed 'cos swift-dispersion-report failed
[09:41:20] <Emperor>	 (though that now works OK, so...)
[09:53:50] <godog>	 Emperor: unrelated, I have https://gerrit.wikimedia.org/r/c/operations/puppet/+/806166 out for your eyes, sort-of urgent as centrallog is quickly filling up, I've added more context to the related task too
[09:54:25] <jynus>	 ah, I was looking at central log host now
[09:57:00] <jynus>	 I thought at first it was due to some logging issues from aqs that happened yesterday, I can see not
[09:57:19] <godog>	 yeah
[10:09:42] <godog>	 sigh I forgot the rsyslog-swift config is *after* sending to centrallog
[10:22:10] <jynus>	 sorry if I am too bossy today, when all DBAs are out, I become hungy with power (aka worried about everyhing that can go wrong on my own)
[10:24:53] <godog>	 ok fix is at https://gerrit.wikimedia.org/r/c/operations/puppet/+/806173
[10:24:59] <godog>	 gotta go to lunch, bbiab
[11:10:57] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:00:36] <marostegui>	 jynus mutante I am leaving db2078 stopped till the phab bug is confirmed fixed by the patch
[12:14:44] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:32:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[13:20:01] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:03:09] <phuedx>	 btullis: Thanks for all of your work on T306181
[15:03:09] <stashbot>	 T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181
[15:05:23] <phuedx>	 Wrong channel. I'll switch to -analytics :D
[15:12:23] <jynus>	 I will extend the m3 codfw downtime until monday, just in case
[15:27:47] <marostegui>	 +1
[16:16:24] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:32:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[17:12:58] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:23:30] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:47:27] <jynus>	 phab bug fixed, restarting replication
[18:48:02] <jynus>	 (we can always rollback to the dump, but now seems very unlikely)
[18:51:26] <jynus>	 it should take a couple of hours to catch up- I have acked the 2 lag alerts for now
[18:56:33] <mutante>	 +1
[18:56:42] <mutante>	 thanks
[19:17:44] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:23:00] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:32:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[20:37:09] <marostegui>	 thanks jynus 
[22:16:08] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers