[06:01:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:48] FIRING: MysqlReplicationLag: MySQL instance db1246:9104@s2 has too large replication lag (1d 16h 11m 38s). Its replication source is db1162.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1246&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [06:26:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1246:9104 has too large replication lag (1d 16h 11m 38s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1246&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [06:59:52] db1246 is pooled and lagging, I am going to depool it [07:00:06] ah no, it is depooled [07:01:36] should I reset-failed the wmf_auto_restart_prometheus-mysqld-exporter.service ? [07:06:13] It is not pooled, it is the host that crashed a few days ago [07:06:21] jynus: I will do it, thanks [07:11:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:00] volans: I was starting to get ready for the prepare cookbook for monday and I was reading everything and ran into: https://phabricator.wikimedia.org/P74216 - line 39 isn't probably expected is it? :) [07:48:59] checking [07:57:15] marostegui: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1127455 that's argparse trying to be too smart with a %section :D [07:57:23] sorry about that [07:57:59] haha no problem :) [07:58:01] BTW the cookbook was on purpouse changed so it can be run on test-s4 or test-s1 as a production test [07:58:32] in case you want to be more familiar with it [08:28:38] marostegui: fix merged and deployed [08:30:04] yeah, I just checked :) [08:30:06] thanks volans [09:51:48] RESOLVED: MysqlReplicationLag: MySQL instance db1246:9104@s2 has too large replication lag (5m 23s). Its replication source is db1162.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1246&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [09:51:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1246:9104 has too large replication lag (5m 23s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1246&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat