[00:42:48] FIRING: [15x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (11m 7s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [00:47:48] FIRING: [19x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (15m 55s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [01:02:48] FIRING: [19x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (29m 55s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [01:07:48] RESOLVED: [19x] MysqlReplicationLagPtHeartbeat: MySQL instance db1160:9104 has too large replication lag (30m 7s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [08:43:29] I have some follow up question from tonight outage when you'll be around [09:51:08] arnaudb: do you think we should do any follow-up checks/actions? [09:52:11] oh sorry I missed your message volans, I haven't had the time to get to it yet but I'll just do a pass on logs, I've already checked monitoring/repl etc. this morning and everything was OK [09:53:46] I checked journalctl on the master, threadpool was full and the event ops.wmf_master_wikiuser_sleep logged it errors, in addition to ( I think expected) semi-sync ON/OFF messages [09:56:13] that may mean that there was a huge transaction to handle [13:56:34] https://youtu.be/1mvVd-DRo7Y "so even wikipedia will be faster" [13:56:42] lets not give an ETA right away :D