[00:11:52] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:32:32] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[01:22:00] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:18:36] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:14:12] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:32:32] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[07:10:59] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:59:57] <marostegui>	 Emperor godog what should we do with ms-be1039? it has an alert 25 days old, should we ACK it?
[08:00:40] <Emperor>	 👀
[08:02:15] <Emperor>	 marostegui: fixed
[08:02:22] <marostegui>	 <3
[08:02:25] <marostegui>	 thanks!
[08:02:53] <marostegui>	 Emperor: if you can check this too at some point: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cumin1001&service=Ensure+hosts+are+not+performing+a+change+on+every+puppet+run
[08:03:12] <marostegui>	 some of the ms are involved
[08:11:59] <Emperor>	 Hm, I think that _may_ be a bug in rsync::server::module
[08:13:04] <Emperor>	 the changing hosts are the nodes that are not ring_manager so are passing absent to rsync::server::module's "ensure" parameter, but it's still installing an rsync server without /etc/rsync.conf so systemd won't start it even though puppet keeps trying to
[08:13:38] <Emperor>	 I wonder if because rsync::server has ensure_service rather than ensure.
[08:20:09] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:23:43] <Emperor>	 I've asked the -foundations folks who actually know about puppet :)
[08:32:32] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[08:33:21] <marostegui>	 jynus_: you mentioned you created a ticket for  ^ or should I?
[08:34:38] <jynus_>	 I didn't, I reported to btullis and the analytics irc chat when whe was unavailable, got acknowlefgment from him and someone else from the team, that's all
[08:35:11] <jynus>	 they promised to look into it
[08:35:59] <btullis>	 Thanks both. Looking into it now. I checked last time but all metrics appeared to be flowing, so perhaps I misunderstood the issue.
[08:36:43] <marostegui>	 jynus: ah cool! thank you
[08:36:46] <marostegui>	 btullis: thank you!
[08:55:25] <Emperor>	 looks like it restarted 3m29s ago
[09:05:49] <btullis>	 Yes, I've restarted the prometheus-mysqld-exporter. Is there a SAL ! log function in this channel? I logged it via #wikimedia-analytics but hadn't got around to writing about it here too.
[09:06:26] <marostegui>	 btullis: no, we don't have sal here, you can do it #wikimedia-operations
[09:11:04] <btullis>	 OK, thanks. Noted for future reference.
[09:12:14] <marostegui>	 thank you
[09:22:59] <godog>	 Emperor marostegui thank you for the heads up and for looking into the alerts!
[10:12:01] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:18:10] <question_mark>	 (sorry for missing the meeting)
[13:01:18] <moritzm>	 would tomorrow morning be a good time to reboot cumin2002 as far as DB/backup activity is concerned?
[13:02:33] <marostegui>	 good from my side
[13:02:36] <marostegui>	 jynus: ^
[13:09:24] <jynus>	 same
[13:09:33] <jynus>	 I have nothing running
[13:41:33] <Emperor>	 marostegui: FYI, T311066 is about the always-changing thing with some of the swift frontends
[13:41:33] <stashbot>	 T311066: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066
[13:42:28] <marostegui>	 thanks :)
[13:59:29] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x2 on db2144 is CRITICAL: 99.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104
[14:00:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x2 on db2142 is CRITICAL: 174.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104
[14:02:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x2 on db2142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104
[14:04:09] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x2 on db2144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104
[15:08:12] <moritzm>	 Emperor: wrt rebooting cumin2002 tomorrow,you have a couple tmux instances running, can you move them to cumin1001 for today?
[15:14:41] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:35:17] <Emperor>	 moritzm: ack, have switched my reboot tmuxen over to cumin1001
[16:21:00] <Emperor>	 sigh, ms-be1055 now on its 6th reboot and not getting its drives in order :-/
[16:22:24] <jynus>	 keep trying, there is probably only factorial(number_of_drives) possible orders? XD
[16:24:26] <Emperor>	 you need to think of it like a poisson distribution - there's a % (25-50) it's right on any given reboot, but that means it's possible it'll never come up correctly in a finite number of reboots
[16:26:19] <jynus>	 I got a C on statistics, so I barely remember about normal distributions... :-(
[16:28:55] <Emperor>	 it's like flipping a coin - you expect a tails about 50% of the time, but you can model how likely you are to not get tails in any number of flips
[16:36:31] <Emperor>	 ...now reboot 12
[16:36:58] <Emperor>	 which is getting really tedious 'cos I'd like to stop for the day sometime round about now...
[16:39:16] <Emperor>	 ah, we may be there now
[17:13:51] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:31:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 33.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[17:32:35] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 17.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[17:33:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[17:34:03] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[18:19:13] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:15:51] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:23:03] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers