[01:09:50] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:24] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [14:36:57] I see an UNKNOWN on db2151, coult it need some zarcillo updates? [14:51:25] mmmmm maybe [14:51:29] I'll take a look [14:51:34] thanks [14:54:36] no rush, not meaning to tell you to do stuff, just tring to understand ongoing status- as there was recently a flood of alerts due to eqsin, and I was reviewing icinga [15:00:17] Should be fixed [15:00:18] now [15:00:48] I have reforced the check [15:03:01] mmm it is not recovering, but the issue was two entries on instance_section table on zarcillo [15:04:23] do you want me to have a second look? maybe it is not zarcillo, but somewhere else [15:04:39] oh, sorry, I just read the last sentence [15:04:57] then it may need a faster refresh on prometheus hosts, I can do that [15:05:46] I did "/usr/local/sbin/mysqld_exporter_config.py codfw '/srv/prometheus/ops/targets'" on prometheus hosts [15:05:52] let's see if that helped [15:07:33] (I think it normally only runs every 20 minutes or so) [15:11:27] I think that did it [15:14:21] Yeah, I also completed deleted it and re-added it [15:15:48] he he [15:16:06] Anyways, looks good, thanks! [15:18:33] I am going to stop sanitarium master for s1, so there will be lag on s1 on wikireplicas [15:18:36] I will !log it now [16:30:48] marostegui: size of the table in every wiki https://phabricator.wikimedia.org/P42984 [16:31:43] Amir1: thanks, yeah I checked it too :) [16:31:49] I am running it with replication enabled [16:31:55] cool [16:32:10] I will get the patch merged anyways [16:32:13] As I already sent it [16:32:19] It doesn't hurt to have another example there [16:32:23] in s1 we probably have to run it without replication I guess, 1M rows [16:32:32] yeah, enwiki will need host by host [16:37:15] btw. it seems that https://phabricator.wikimedia.org/T321126 hasn't reached labtestwiki, could someone do that? Thanks. :) [16:42:52] zabe: AFAIK labtestwiki is not a WMF production wiki [16:43:35] I think it lives in cloud or outside our network, andrew used to handle that, but not sure if he is still doing it [16:45:49] ok [16:45:55] (someone in the cloud IRC channel may know more, sorry) [16:48:34] zabe: I will get it done [16:48:51] zabe: done [16:53:32] (MysqlReplicationLag) firing: MySQL instance db1106:9104 has too large replication lag (1h 26m 11s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1106&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [16:53:39] ^ me [16:58:32] (MysqlReplicationLag) firing: (3) MySQL instance db1106:9104 has too large replication lag (1h 0m 4s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [17:08:32] (MysqlReplicationLag) firing: (3) MySQL instance db1106:9104 has too large replication lag (8m 44s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [17:18:32] (MysqlReplicationLag) resolved: (3) MySQL instance db1106:9104 has too large replication lag (8m 44s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [18:27:36] thanks