[00:22:09] FIRING: [3x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:22:09] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:06] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1017:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1017&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [08:22:09] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:06] RESOLVED: MysqlReplicationThreadCountTooLow: MySQL instance pc1017:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1017&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [08:32:28] Anyway doing anything with db1125? [08:32:31] It seems to be down [08:32:52] It was rebooted on Thursday, I guess mariadb wasn't brought up? [08:33:03] I will get it fixed [08:36:50] pc1017 is fixed (also back in orchestratror as there were some grants issues there) [08:54:14] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:09] RESOLVED: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:47] <_joe_> Genuine question: I see these "pt-heartbeat-wikimedia" service failures constantly [09:02:58] <_joe_> as-is it seems like an alert that's designed to be ignored [09:03:13] <_joe_> maybe we need to raise the interval of checks for it? [09:04:14] Also db1125 is a test host. It shouldn't alert at all. Still alerting as you see [09:06:55] Amir1: missing downtime, or is there some config not being applied correctly? [09:07:39] profile::monitoring::notifications_enabled is set to false [09:10:30] but that doesn't prevent IRC notifications IIRC [09:12:06] :sadpanda: [09:13:17] (it probably _ought_ to?) [09:13:20] no actually my bad [09:13:25] it does prevent notifications in icinga [09:13:27] correctly [09:13:40] notifications_enabled is set to 0 for db1125 [09:13:49] so icinga will not alert you on that host [09:14:07] if the mechanism of profile::monitoring::notifications_enabled has been properly integrated into AM I'm not sure [09:15:20] Maybe time for a phab ticket chez observability [10:50:29] volans: yes, AM always by passes the puppet notification mechanism [10:50:36] (at least for DBs) [11:15:01] <_joe_> for everything [11:15:05] <_joe_> there is no integration [11:43:44] :sadpanda: [12:57:37] i'm considering canceling today's team meeting. let me know if I shouldn't do that in the next 30 minutes... [13:03:48] ah, yes, we are a bit short-handed today [16:32:09] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:02:09] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:09] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed