[00:11:52] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:32:32] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [01:22:00] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:18:36] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:14:12] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:32:32] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [07:10:59] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:59:57] Emperor godog what should we do with ms-be1039? it has an alert 25 days old, should we ACK it? [08:00:40] 👀 [08:02:15] marostegui: fixed [08:02:22] <3 [08:02:25] thanks! [08:02:53] Emperor: if you can check this too at some point: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cumin1001&service=Ensure+hosts+are+not+performing+a+change+on+every+puppet+run [08:03:12] some of the ms are involved [08:11:59] Hm, I think that _may_ be a bug in rsync::server::module [08:13:04] the changing hosts are the nodes that are not ring_manager so are passing absent to rsync::server::module's "ensure" parameter, but it's still installing an rsync server without /etc/rsync.conf so systemd won't start it even though puppet keeps trying to [08:13:38] I wonder if because rsync::server has ensure_service rather than ensure. [08:20:09] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:23:43] I've asked the -foundations folks who actually know about puppet :) [08:32:32] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [08:33:21] jynus_: you mentioned you created a ticket for ^ or should I? [08:34:38] I didn't, I reported to btullis and the analytics irc chat when whe was unavailable, got acknowlefgment from him and someone else from the team, that's all [08:35:11] they promised to look into it [08:35:59] Thanks both. Looking into it now. I checked last time but all metrics appeared to be flowing, so perhaps I misunderstood the issue. [08:36:43] jynus: ah cool! thank you [08:36:46] btullis: thank you! [08:55:25] looks like it restarted 3m29s ago [09:05:49] Yes, I've restarted the prometheus-mysqld-exporter. Is there a SAL ! log function in this channel? I logged it via #wikimedia-analytics but hadn't got around to writing about it here too. [09:06:26] btullis: no, we don't have sal here, you can do it #wikimedia-operations [09:11:04] OK, thanks. Noted for future reference. [09:12:14] thank you [09:22:59] Emperor marostegui thank you for the heads up and for looking into the alerts! [10:12:01] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:18:10] (sorry for missing the meeting) [13:01:18] would tomorrow morning be a good time to reboot cumin2002 as far as DB/backup activity is concerned? [13:02:33] good from my side [13:02:36] jynus: ^ [13:09:24] same [13:09:33] I have nothing running [13:41:33] marostegui: FYI, T311066 is about the always-changing thing with some of the swift frontends [13:41:33] T311066: rsync::server::module installs an rsync server even when $ensure is absent - https://phabricator.wikimedia.org/T311066 [13:42:28] thanks :) [13:59:29] PROBLEM - MariaDB sustained replica lag on x2 on db2144 is CRITICAL: 99.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104 [14:00:01] PROBLEM - MariaDB sustained replica lag on x2 on db2142 is CRITICAL: 174.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104 [14:02:23] RECOVERY - MariaDB sustained replica lag on x2 on db2142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104 [14:04:09] RECOVERY - MariaDB sustained replica lag on x2 on db2144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104 [15:08:12] Emperor: wrt rebooting cumin2002 tomorrow,you have a couple tmux instances running, can you move them to cumin1001 for today? [15:14:41] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:17] moritzm: ack, have switched my reboot tmuxen over to cumin1001 [16:21:00] sigh, ms-be1055 now on its 6th reboot and not getting its drives in order :-/ [16:22:24] keep trying, there is probably only factorial(number_of_drives) possible orders? XD [16:24:26] you need to think of it like a poisson distribution - there's a % (25-50) it's right on any given reboot, but that means it's possible it'll never come up correctly in a finite number of reboots [16:26:19] I got a C on statistics, so I barely remember about normal distributions... :-( [16:28:55] it's like flipping a coin - you expect a tails about 50% of the time, but you can model how likely you are to not get tails in any number of flips [16:36:31] ...now reboot 12 [16:36:58] which is getting really tedious 'cos I'd like to stop for the day sometime round about now... [16:39:16] ah, we may be there now [17:13:51] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:31:01] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 33.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [17:32:35] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 17.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [17:33:45] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [17:34:03] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [18:19:13] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:15:51] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:23:03] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers