[08:30:43] is there a task for the noisy PrometheusMysqldExporterFailed errors? I think I've found the issue [08:31:23] all the failed ones are lacking this specific grant: [08:31:23] GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost` [08:32:08] that matches the error: [08:32:11] err="Error 1142: SELECT command denied to user 'prometheus'@'localhost' for table `heartbeat`.`heartbeat`" [08:32:18] from journalctl [12:00:37] volans: https://people.wikimedia.org/~ladsgroup/omg/ check promotheus for user + localhost for target [12:00:48] GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost` [12:00:54] This seems to be missing in six hosts [12:14:35] LOL @ "Oh My Grants!", I didn't know about it :) [12:17:48] the reason it has this name is that I loudly said OMG when I first produced the report of our grants. It's much better now but I kept the name :D [12:22:18] hahaha makes a lot of sense :D [13:02:25] Amir1: we have 5 instances complaining in alerts.w.o, 2 of which on the same host [13:03:17] the question is what's the procedure to add it? :D [13:17:39] volans: just login and add the right, just make sure you add "set session sql_log_bin=0;" before to avoid it being replicated (it's not that it matters in this case, just a good hygiene) [13:18:02] I have a script to run changes like this en masse but for five, manually it's easier [13:18:21] yeah ofc (no replication), ack thx [13:46:51] Amir1: I've fixed 2, but the other 3 don't have the heartbeat database at all. Is there an easy way to tell the exporter to not check for heartbeat? [15:53:49] volans: there should be a service I think it's called pt-heartbeat or something. Maybe check for that in the host? [15:54:39] no what I meant is that they are probably not supposed to have it [15:55:11] like on dbstore1009 in the staging instance, on db1208 is on the matomo and analytics_meta instances [15:55:13] if they don't have replication set up, then pt-heartbeat doesn't make sense, are they RO ES hosts? [15:55:31] ^^^ [15:55:34] ah, I don't know :( these are really special cases [15:55:42] let me think a bit about it [15:55:56] so I think is correct they don't have heartbeat, the question is why prometheus is complaining [15:56:16] or if there is a way to tell it to skip the heartbeat metrics [15:56:54] we can add a hiera role or variable or something [15:57:09] "skip_heartbeat" [15:57:20] that'd make it explicit [15:57:33] I don't like implicit logic. They don't need it for different reasons [15:57:43] to skip the --collect.heartbeat you mean? I think I've seen a patch like this passing by recently by arnaud [15:57:54] I'll have a look [15:58:01] maybe is just a missing hiera key [15:58:31] yeah, something like a hiera value being set in the host's hiera file and when it's set, promehtues exporter skips it [15:58:34] or something like that [15:58:40] I have to go afk for a bit [16:16:06] as far as I know prometheus did not check pt-heartbeat on /any/ server until a few weeks ago [16:16:30] arnaud started using it to implement the new prometheus-based alerts that will replace the old icinga ones [16:16:46] maybe in the phab tasks there are mentions to which server should be included etc.? [16:16:54] *servers [16:20:49] yeah I'll have a look shortly at the patch, doing something else right now :) [18:35:55] I've opened T371049 as a follow up [18:35:55] T371049: prometheus-mysqld-exporter doesn't take fully support multi-instances for pt-heartbeat - https://phabricator.wikimedia.org/T371049 [23:20:48] FIRING: [20x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 8s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [23:30:58] RESOLVED: [20x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (21m 56s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat