[01:07:04] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 5 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:08:08] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:47:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[05:47:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[09:47:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[09:53:15] <Emperor>	 that's an analytics node, should it be alerting here?
[10:00:30] <marostegui>	 Emperor: probably not, no
[12:18:23] <jynus>	 here is my proposal, as warned: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868072
[12:18:47] <jynus>	 (going for lunch, answer on patch if you have comments)
[12:21:28] <marostegui>	 I will comment there
[13:47:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[14:04:08] <Amir1>	 marostegui: this should fix the alert https://gerrit.wikimedia.org/r/c/operations/alerts/+/868085
[14:05:05] <marostegui>	 how can we test that?
[14:05:44] <Amir1>	 there is the unit test being changed there but on top of that, check https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=dbstore&var-shard=All&var-role=All&editPanel=1
[14:05:46] <marostegui>	 Should we just introduce replication lag somewhere (ie: db1117:3321)
[14:06:17] <Amir1>	 that's not the replication lag, it's the prometheus exporter failing, we can kill the exporter
[14:06:36] <marostegui>	 oh sorry yes, I just thought about the lag one
[14:06:48] <Amir1>	 the lag one should only show up for core dbs
[14:06:48] <marostegui>	 +1ed
[14:06:51] <marostegui>	 yeah
[14:08:49] <Amir1>	 Thanks. Gonna merge it and possibly kill the prometheius  exporter somewhere to test
[14:09:02] <marostegui>	 sounds good
[14:14:58] <Emperor>	 thanks :)
[14:18:54] <Amir1>	 ^_^
[14:19:10] <Amir1>	 we should replace the TODO with an actual runbook though 
[14:19:16] <Amir1>	 future me problem
[14:24:52] <Amir1>	 db1152.yaml:profile::monitoring::notifications_enabled: false and it's a x2 replica meaning it won't get any traffic. Shall I stop promethues exporter there?
[14:25:04] <marostegui>	 sure
[14:26:02] <Amir1>	 done
[14:29:36] <Amir1>	 >       for: 30m
[14:29:42] <Amir1>	 I have to wait half an hour :D
[14:52:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (an-coord1001:9104) - TODO - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[14:53:04] <Amir1>	 huh
[14:53:08] <Amir1>	 sigh
[14:53:38] <Amir1>	 it's possible it's not deployed yet. I'll have to check
[14:55:42] <Amir1>	 hmm, that was for resolving, good I guess?
[15:59:31] <marostegui>	 but why didn't db1152 alert yet?
[15:59:47] <marostegui>	 did puppet start the exporter?