[04:58:34] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1252 is CRITICAL: 8905 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104
[05:00:48] <jinxer-wm>	 FIRING: MysqlReplicationLag: MySQL instance db1252:9104@s4 has too large replication lag (2h 3m 39s). Its replication source is db1244.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[05:02:48] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db1252:9104 has too large replication lag (1h 57m 10s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[05:09:24] <swfrench-wmf>	 FYI, it looks like the schema change running on db1252 for T399249 has become stuck, and the downtime expired
[05:09:25] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[05:10:21] <swfrench-wmf>	 I've extended it for another 36h to give you folks time to investigate during business hours (it's already depooled https://sal.toolforge.org/log/LXi52JgB8tZ8Ohr03WP0)
[05:23:36] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1252 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1252&var-port=9104
[05:36:18] <jinxer-wm>	 RESOLVED: MysqlReplicationLag: MySQL instance db1252:9104@s4 has too large replication lag (5m 32s). Its replication source is db1244.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1252&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[05:37:22] <swfrench-wmf>	 ^ I've removed the extended downtime, as the host is now back and repooling
[12:00:57] <Amir1>	 Thanks Scott
[12:35:50] <federico3>	 catching up took longer than other hosts, just about enough to page https://grafana-rw.wikimedia.org/d/bd60e6f6-11fc-47f4-a6ba-109c1aed251d/federico-s-mariadb-replication-dash?folderUid=Wagp6Ryik&from=2025-08-22T07%3A20%3A10.725Z&orgId=1&timezone=utc&to=2025-08-24T09%3A22%3A06.710Z - a downtime of 12 hours would have been better to be on the safe side
[16:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:06:48] <jinxer-wm>	 FIRING: [3x] MysqlReplicationLagPtHeartbeat: MySQL instance db1242:9104 has too large replication lag (11m 1s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[23:11:48] <jinxer-wm>	 FIRING: [22x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 25s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[23:16:48] <jinxer-wm>	 RESOLVED: [22x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 37s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[23:22:53] <Amir1>	 db1150 I think is backup being made