[07:46:14] <marostegui>	 I broke the prometheus mysql exporter for a bit, but it should be fixed by now
[08:40:55] <jynus>	 Not sure I understand the prometheus job errors for mysql? metrics seem to be still going through?
[08:46:49] <marostegui>	 Maybe you missed this: [08:46:14]  <marostegui> I broke the prometheus mysql exporter for a bit, but it should be fixed by now
[08:47:54] <jynus>	 I indeed did miss it
[08:49:37] <jynus>	 however, for some reason I think it still thinks jobs are failing (even if they are not)
[08:49:49] <marostegui>	 uh? 
[08:49:51] <marostegui>	 that's strange
[08:50:03] <jynus>	 see: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1
[08:50:08] <marostegui>	 everything should be back to normal
[08:50:20] <jynus>	 yeah, I see something hapening at that time and then coming back
[08:50:38] <jynus>	 but monitoring is still complaining- even if metrics look right
[08:50:55] <jynus>	 so maybe there is something else- not essential
[08:51:15] <marostegui>	 that's strange because only mysqld-exporter should've broken
[08:51:33] <jynus>	 it says availability is 40%, when I haven't found any missing metrics yet
[08:53:10] <marostegui>	 I can't find anything wrong with zarcillo
[08:53:16] <marostegui>	 which could trigger something
[08:53:41] <jynus>	 yeah, look at: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&viewPanel=5
[08:54:02] <marostegui>	 yeah it looks clean
[08:54:25] <jynus>	 oh, it doesn't to me
[08:54:45] <jynus>	 as in, it says it has a lot of unpollable hosts, but those are working well
[08:54:50] <marostegui>	 No, what I mean is that 1 error isn't something strange
[08:55:05] <marostegui>	 And those hosts are having data: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1015&var-port=9104
[08:55:12] <marostegui>	 so they are definitely pollable
[08:55:25] <arnaudb>	 fyi I stopped haproxy and removed its user on dbproxy1017 → https://phabricator.wikimedia.org/T348956 
[08:56:44] <marostegui>	 jynus: maybe another case of https://phabricator.wikimedia.org/T327384 ?
[08:56:46] <marostegui>	 Don't know
[08:57:12] <jynus>	 yeah, I was thinking in the line of duplicate metrics
[08:57:18] <jynus>	 not sure if exactly that
[08:57:33] <jynus>	 or something like having hosts twice, one with bad config or something
[08:57:43] <arnaudb>	 can someone +1 my tiny patch please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/970832
[08:57:48] <marostegui>	 https://phabricator.wikimedia.org/P53127
[08:57:57] <marostegui>	 arnaudb: I did already 10 minutes ago
[08:57:58] <jynus>	 I am checking the prometheus config
[08:58:01] <arnaudb>	 oh thanks
[08:58:10] <arnaudb>	 ah indeed cached page
[08:58:14] <marostegui>	 jynus: not sure if those eqiad files are supposed to be there on the codfw host
[08:58:37] <marostegui>	 According to this: https://phabricator.wikimedia.org/T327384#8540647 it shouldn't
[08:58:45] <marostegui>	 so I think I am going to remove them
[08:58:55] <jynus>	 ah, that could be it then
[08:59:15] <jynus>	 I was checking eqiad and it didn't have duplicates
[09:02:20] <marostegui>	 https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=now-30m&to=now looks like some of them are getting fixed?
[09:03:49] <jynus>	 yeah, that was it
[09:03:54] <jynus>	 will bump the ticket
[09:04:33] <marostegui>	 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&viewPanel=5 this is still not getting cleaned thoguh
[09:04:35] <marostegui>	 though
[09:04:41] <marostegui>	 ]  <+jinxer-wm> (JobUnavailable) resolved: (5) Reduced availability for job mysql-core in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:04:42] <marostegui>	 \o/
[09:04:59] <jynus>	 it may be one of the few hosts?
[09:05:29] <marostegui>	 This is definitely looking fixed: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=now-30m&to=now
[09:06:12] <jynus>	 so just to be clear, not your fault, but could I ask you to use "systemctl start generate-mysqld-exporter-config" until it is fixed?
[09:06:25] <marostegui>	 yeah, no worries
[09:06:41] <jynus>	 that is simpler and it is run equally on both dcs
[09:07:08] <jynus>	 I will make it impossible to run it wrong soon :-)
[09:07:14] <jynus>	 or better, ask arnaudb to do it
[09:07:14] <marostegui>	 haha
[09:07:19] <marostegui>	 +1!
[09:07:41] <jynus>	 as I'd prefere the know how wouds stay among the dbas
[09:07:50] <arnaudb>	 I can if somebody bring me up to speed :D
[09:07:59] <jynus>	 I'd be more than happy to
[09:08:21] <jynus>	 marostegui: do you think he has the time? don't want to take from other tasks
[09:08:47] <jynus>	 the fix would be just 1 day of creating a puppet patch
[09:08:56] <arnaudb>	 I have enough work to keep myself busy for the foreseeable future but I'd be glad to be of some help :)
[09:09:22] <marostegui>	 yeah, I think it is good if you can bring him up to speed and then he can prioritize it amongst all the tasks
[09:09:29] <jynus>	 ok
[09:09:53] <marostegui>	 Some of those tasks will get blocked
[09:13:28] <jynus>	 sorry about this, when I logged in and saw all mysql alerts failing I thought we were on a big outage
[22:16:59] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1183 is CRITICAL: 211 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1183&var-port=9104
[22:17:43] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2157 is CRITICAL: 267 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2157&var-port=9104
[22:18:05] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2171 is CRITICAL: 275 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13315
[22:18:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2178 is CRITICAL: 262 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2178&var-port=9104
[22:18:09] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2137 is CRITICAL: 287 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315
[22:18:21] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2128 is CRITICAL: 281 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2128&var-port=9104
[22:18:33] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2111 is CRITICAL: 314 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2111&var-port=9104
[22:18:49] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2123 is CRITICAL: 290 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2123&var-port=9104
[23:23:01] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315
[23:23:55] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2157 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2157&var-port=9104
[23:41:31] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2128 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2128&var-port=9104