[01:09:28] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:18] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [14:27:40] Emperor: not sure if known but ms-be2067 status seems to be a bit borked... looks like a broken disk but the MegaRAID and MD RAID checks are both ok. Other things are failing though. [14:28:20] megaRAID seems increasingly useless at noticing failing disks :( [14:30:47] * Emperor slightly drowning under too many things to fix at once [14:31:30] (and the fix that might make swift-drive-audit useable again isn't getting rolled out locally any time soon, as it's not high enough priority compared to all the other fixes :( ) [14:33:35] Looks like ms-be2067 has at least two broken disks [14:55:33] T331030 opened [14:55:34] T331030: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030