[07:46:14] I broke the prometheus mysql exporter for a bit, but it should be fixed by now [08:40:55] Not sure I understand the prometheus job errors for mysql? metrics seem to be still going through? [08:46:49] Maybe you missed this: [08:46:14] I broke the prometheus mysql exporter for a bit, but it should be fixed by now [08:47:54] I indeed did miss it [08:49:37] however, for some reason I think it still thinks jobs are failing (even if they are not) [08:49:49] uh? [08:49:51] that's strange [08:50:03] see: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [08:50:08] everything should be back to normal [08:50:20] yeah, I see something hapening at that time and then coming back [08:50:38] but monitoring is still complaining- even if metrics look right [08:50:55] so maybe there is something else- not essential [08:51:15] that's strange because only mysqld-exporter should've broken [08:51:33] it says availability is 40%, when I haven't found any missing metrics yet [08:53:10] I can't find anything wrong with zarcillo [08:53:16] which could trigger something [08:53:41] yeah, look at: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&viewPanel=5 [08:54:02] yeah it looks clean [08:54:25] oh, it doesn't to me [08:54:45] as in, it says it has a lot of unpollable hosts, but those are working well [08:54:50] No, what I mean is that 1 error isn't something strange [08:55:05] And those hosts are having data: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1015&var-port=9104 [08:55:12] so they are definitely pollable [08:55:25] fyi I stopped haproxy and removed its user on dbproxy1017 → https://phabricator.wikimedia.org/T348956 [08:56:44] jynus: maybe another case of https://phabricator.wikimedia.org/T327384 ? [08:56:46] Don't know [08:57:12] yeah, I was thinking in the line of duplicate metrics [08:57:18] not sure if exactly that [08:57:33] or something like having hosts twice, one with bad config or something [08:57:43] can someone +1 my tiny patch please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/970832 [08:57:48] https://phabricator.wikimedia.org/P53127 [08:57:57] arnaudb: I did already 10 minutes ago [08:57:58] I am checking the prometheus config [08:58:01] oh thanks [08:58:10] ah indeed cached page [08:58:14] jynus: not sure if those eqiad files are supposed to be there on the codfw host [08:58:37] According to this: https://phabricator.wikimedia.org/T327384#8540647 it shouldn't [08:58:45] so I think I am going to remove them [08:58:55] ah, that could be it then [08:59:15] I was checking eqiad and it didn't have duplicates [09:02:20] https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=now-30m&to=now looks like some of them are getting fixed? [09:03:49] yeah, that was it [09:03:54] will bump the ticket [09:04:33] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-3h&to=now&viewPanel=5 this is still not getting cleaned thoguh [09:04:35] though [09:04:41] ] <+jinxer-wm> (JobUnavailable) resolved: (5) Reduced availability for job mysql-core in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:42] \o/ [09:04:59] it may be one of the few hosts? [09:05:29] This is definitely looking fixed: https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=now-30m&to=now [09:06:12] so just to be clear, not your fault, but could I ask you to use "systemctl start generate-mysqld-exporter-config" until it is fixed? [09:06:25] yeah, no worries [09:06:41] that is simpler and it is run equally on both dcs [09:07:08] I will make it impossible to run it wrong soon :-) [09:07:14] or better, ask arnaudb to do it [09:07:14] haha [09:07:19] +1! [09:07:41] as I'd prefere the know how wouds stay among the dbas [09:07:50] I can if somebody bring me up to speed :D [09:07:59] I'd be more than happy to [09:08:21] marostegui: do you think he has the time? don't want to take from other tasks [09:08:47] the fix would be just 1 day of creating a puppet patch [09:08:56] I have enough work to keep myself busy for the foreseeable future but I'd be glad to be of some help :) [09:09:22] yeah, I think it is good if you can bring him up to speed and then he can prioritize it amongst all the tasks [09:09:29] ok [09:09:53] Some of those tasks will get blocked [09:13:28] sorry about this, when I logged in and saw all mysql alerts failing I thought we were on a big outage [22:16:59] PROBLEM - MariaDB sustained replica lag on s5 on db1183 is CRITICAL: 211 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1183&var-port=9104 [22:17:43] PROBLEM - MariaDB sustained replica lag on s5 on db2157 is CRITICAL: 267 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2157&var-port=9104 [22:18:05] PROBLEM - MariaDB sustained replica lag on s5 on db2171 is CRITICAL: 275 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2171&var-port=13315 [22:18:09] PROBLEM - MariaDB sustained replica lag on s5 on db2178 is CRITICAL: 262 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2178&var-port=9104 [22:18:09] PROBLEM - MariaDB sustained replica lag on s5 on db2137 is CRITICAL: 287 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [22:18:21] PROBLEM - MariaDB sustained replica lag on s5 on db2128 is CRITICAL: 281 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2128&var-port=9104 [22:18:33] PROBLEM - MariaDB sustained replica lag on s5 on db2111 is CRITICAL: 314 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2111&var-port=9104 [22:18:49] PROBLEM - MariaDB sustained replica lag on s5 on db2123 is CRITICAL: 290 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2123&var-port=9104 [23:23:01] RECOVERY - MariaDB sustained replica lag on s5 on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13315 [23:23:55] RECOVERY - MariaDB sustained replica lag on s5 on db2157 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2157&var-port=9104 [23:41:31] RECOVERY - MariaDB sustained replica lag on s5 on db2128 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2128&var-port=9104