[01:07:04] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:08:08] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:47:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [05:47:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [09:47:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [09:53:15] that's an analytics node, should it be alerting here? [10:00:30] Emperor: probably not, no [12:18:23] here is my proposal, as warned: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868072 [12:18:47] (going for lunch, answer on patch if you have comments) [12:21:28] I will comment there [13:47:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [14:04:08] marostegui: this should fix the alert https://gerrit.wikimedia.org/r/c/operations/alerts/+/868085 [14:05:05] how can we test that? [14:05:44] there is the unit test being changed there but on top of that, check https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=dbstore&var-shard=All&var-role=All&editPanel=1 [14:05:46] Should we just introduce replication lag somewhere (ie: db1117:3321) [14:06:17] that's not the replication lag, it's the prometheus exporter failing, we can kill the exporter [14:06:36] oh sorry yes, I just thought about the lag one [14:06:48] the lag one should only show up for core dbs [14:06:48] +1ed [14:06:51] yeah [14:08:49] Thanks. Gonna merge it and possibly kill the prometheius exporter somewhere to test [14:09:02] sounds good [14:14:58] thanks :) [14:18:54] ^_^ [14:19:10] we should replace the TODO with an actual runbook though [14:19:16] future me problem [14:24:52] db1152.yaml:profile::monitoring::notifications_enabled: false and it's a x2 replica meaning it won't get any traffic. Shall I stop promethues exporter there? [14:25:04] sure [14:26:02] done [14:29:36] > for: 30m [14:29:42] I have to wait half an hour :D [14:52:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [14:53:04] huh [14:53:08] sigh [14:53:38] it's possible it's not deployed yet. I'll have to check [14:55:42] hmm, that was for resolving, good I guess? [15:59:31] but why didn't db1152 alert yet? [15:59:47] did puppet start the exporter?