[08:30:43] <volans>	 is there a task for the noisy PrometheusMysqldExporterFailed errors? I think I've found the issue
[08:31:23] <volans>	 all the failed ones are lacking this specific grant:
[08:31:23] <volans>	 GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`
[08:32:08] <volans>	 that matches the error:
[08:32:11] <volans>	 err="Error 1142: SELECT command denied to user 'prometheus'@'localhost' for table `heartbeat`.`heartbeat`"
[08:32:18] <volans>	 from journalctl
[12:00:37] <Amir1>	 volans: https://people.wikimedia.org/~ladsgroup/omg/ check promotheus for user + localhost for target
[12:00:48] <Amir1>	 GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`
[12:00:54] <Amir1>	 This seems to be missing in six hosts
[12:14:35] <dhinus>	 LOL @ "Oh My Grants!", I didn't know about it :)
[12:17:48] <Amir1>	 the reason it has this name is that I loudly said OMG when I first produced the report of our grants. It's much better now but I kept the name :D
[12:22:18] <dhinus>	 hahaha makes a lot of sense :D
[13:02:25] <volans>	 Amir1: we have 5 instances complaining in alerts.w.o, 2 of which on the same host
[13:03:17] <volans>	 the question is what's the procedure to add it? :D
[13:17:39] <Amir1>	 volans: just login and add the right, just make sure you add "set session sql_log_bin=0;" before to avoid it being replicated (it's not that it matters in this case, just a good hygiene) 
[13:18:02] <Amir1>	 I have a script to run changes like this en masse but for five, manually it's easier
[13:18:21] <volans>	 yeah ofc (no replication), ack thx
[13:46:51] <volans>	 Amir1: I've fixed 2, but the other 3 don't have the heartbeat database at all. Is there an easy way to tell the exporter to not check for heartbeat?
[15:53:49] <Amir1>	 volans: there should be a service I think it's called pt-heartbeat or something. Maybe check for that in the host?
[15:54:39] <volans>	 no what I meant is that they are probably not supposed to have it
[15:55:11] <volans>	 like on dbstore1009 in the staging instance, on db1208 is on the  matomo and analytics_meta instances
[15:55:13] <Amir1>	 if they don't have replication set up, then pt-heartbeat doesn't make sense, are they RO ES hosts?
[15:55:31] <volans>	 ^^^
[15:55:34] <Amir1>	 ah, I don't know :( these are really special cases
[15:55:42] <Amir1>	 let me think a bit about it
[15:55:56] <volans>	 so I think is correct they don't have heartbeat, the question is why prometheus is complaining
[15:56:16] <volans>	 or if there is a way to tell it to skip the heartbeat metrics
[15:56:54] <Amir1>	 we can add a hiera role or variable or something
[15:57:09] <Amir1>	 "skip_heartbeat"
[15:57:20] <Amir1>	 that'd make it explicit 
[15:57:33] <Amir1>	 I don't like implicit logic. They don't need it for different reasons
[15:57:43] <volans>	 to skip the --collect.heartbeat you mean? I think I've seen a patch like this passing by recently by arnaud
[15:57:54] <volans>	 I'll have a look
[15:58:01] <volans>	 maybe is just a missing hiera key
[15:58:31] <Amir1>	 yeah, something like a hiera value being set in the host's hiera file and when it's set, promehtues exporter skips it 
[15:58:34] <Amir1>	 or something like that
[15:58:40] <Amir1>	 I have to go afk for a bit
[16:16:06] <dhinus>	 as far as I know prometheus did not check pt-heartbeat on /any/ server until a few weeks ago
[16:16:30] <dhinus>	 arnaud started using it to implement the new prometheus-based alerts that will replace the old icinga ones
[16:16:46] <dhinus>	 maybe in the phab tasks there are mentions to which server should be included etc.?
[16:16:54] <dhinus>	 *servers
[16:20:49] <volans>	 yeah I'll have a look shortly at the patch, doing something else right now :)
[18:35:55] <volans>	 I've opened T371049 as a follow up
[18:35:55] <stashbot>	 T371049: prometheus-mysqld-exporter doesn't take fully support multi-instances for pt-heartbeat - https://phabricator.wikimedia.org/T371049
[23:20:48] <jinxer-wm>	 FIRING: [20x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 8s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[23:30:58] <jinxer-wm>	 RESOLVED: [20x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (21m 56s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat