[01:21:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1223:9104 has too large replication lag (11m 44s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [02:04:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db1223:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [05:21:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1223:9104 has too large replication lag (4h 11m 12s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [06:00:00] PROBLEM - MariaDB sustained replica lag on s3 on db1223 is CRITICAL: 1.696e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1223&var-port=9104 [06:04:06] RESOLVED: MysqlReplicationThreadCountTooLow: MySQL instance db1223:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [06:09:48] FIRING: MysqlReplicationLag: MySQL instance db1223:9104@s3 has too large replication lag (3h 0m 26s). Its replication source is db1189.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [06:24:48] RESOLVED: MysqlReplicationLag: MySQL instance db1223:9104@s3 has too large replication lag (10m 23s). Its replication source is db1189.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [06:26:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1223:9104 has too large replication lag (10m 23s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1223&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [06:29:00] RECOVERY - MariaDB sustained replica lag on s3 on db1223 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1223&var-port=9104 [08:53:52] ^-- expected? dumps? something else? [08:57:39] Amir1 was working on s3 yesterday, maybe it was a side effect of the schema change? [09:14:02] over in #wikimedia-sre T381142 was created in response to a village pump report [09:14:02] T381142: db1223 (s3 eqiad candidate master) replication broken - https://phabricator.wikimedia.org/T381142 [09:17:00] ah ack thanks! [09:31:53] Emperor: o/ ETA for the ms-be nodes - for codfw I'd say early next week, for eqiad hopefully end of next week. [09:36:26] elukey: cool, thanks for the update :) [14:29:34] Emperor: Making slow but steady progress: https://grafana.wikimedia.org/d/000000378/ladsgroup-test?orgId=1&viewPanel=26&from=now-7d&to=now [14:31:28] when some containers are done, we should eyeball the db sizes (I suspect some sort of complicated VACUUM-related faff may be need3ed) [14:32:31] yeah [18:38:00] PROBLEM - MariaDB sustained replica lag on s7 on db2150 is CRITICAL: 10.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104 [18:39:00] RECOVERY - MariaDB sustained replica lag on s7 on db2150 is OK: (C)10 ge (W)5 ge 4.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2150&var-port=9104