[04:58:34] PROBLEM - MariaDB sustained replica lag on s4 on db1252 is CRITICAL: 8905 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104 [05:00:48] FIRING: MysqlReplicationLag: MySQL instance db1252:9104@s4 has too large replication lag (2h 3m 39s). Its replication source is db1244.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [05:02:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db1252:9104 has too large replication lag (1h 57m 10s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [05:09:24] FYI, it looks like the schema change running on db1252 for T399249 has become stuck, and the downtime expired [05:09:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:10:21] I've extended it for another 36h to give you folks time to investigate during business hours (it's already depooled https://sal.toolforge.org/log/LXi52JgB8tZ8Ohr03WP0) [05:23:36] RECOVERY - MariaDB sustained replica lag on s4 on db1252 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104 [05:36:18] RESOLVED: MysqlReplicationLag: MySQL instance db1252:9104@s4 has too large replication lag (5m 32s). Its replication source is db1244.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [05:37:22] ^ I've removed the extended downtime, as the host is now back and repooling [12:00:57] Thanks Scott [12:35:50] catching up took longer than other hosts, just about enough to page https://grafana-rw.wikimedia.org/d/bd60e6f6-11fc-47f4-a6ba-109c1aed251d/federico-s-mariadb-replication-dash?folderUid=Wagp6Ryik&from=2025-08-22T07%3A20%3A10.725Z&orgId=1&timezone=utc&to=2025-08-24T09%3A22%3A06.710Z - a downtime of 12 hours would have been better to be on the safe side [16:45:25] FIRING: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:25] RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:48] FIRING: [3x] MysqlReplicationLagPtHeartbeat: MySQL instance db1242:9104 has too large replication lag (11m 1s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [23:11:48] FIRING: [22x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 25s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [23:16:48] RESOLVED: [22x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 37s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [23:22:53] db1150 I think is backup being made