[01:09:08] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:09:14] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 5.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:24] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 18.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:42] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:10:48] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:58] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [11:25:09] Emperor: I was thinking more about T327253, and assuming data is not where metadata says it is (e.g. maybe we try to download again to download them with something other than rclone, and also we compare with codfw) maybe the right way is to extract all possible leftover metadata (container, path and hash, if surviving) and purge them from swift, then keep the list for other people to fix at a later time at T289996 [11:25:17] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [11:25:17] T289996: Media storage metadata inconsistent with Swift or corrupted in general - https://phabricator.wikimedia.org/T289996 [11:26:53] the only worry would be to make sure it is not happening now -e.g. it is not because of a swift bug (e.g. by checking the most recent case in the list) [11:34:26] jynus: I did check a couple of the examples with the swift cli, and I'm confident that rclone is correct that these objects exist in the container listing but aren't GET/HEAD-able [11:35:02] do the listing only have names? does it have hashes or size or something else? [11:35:21] Basically to start purging but making sure all eforts later have some information [11:35:25] size, md5, creation time (I think, I could check) [11:35:49] that nicely punts the "dig through the bowels of mw for details about these things" question :) [11:36:13] So I would try to export those if possible, then purge (carfurly) [11:36:47] one thing we could do is wait for me to finish the backup update this quarter and do the purge afterwards [11:37:50] and then send the list to the multimedia team on the ticket above ;-D [11:45:19] I belive the frequent m1 lagging comes from heavy writes from bacula [11:45:41] that should be mitigated when bacula is migrated to the new dbs [11:49:38] That's a reasonable plan. [11:50:11] I doubt the multimedia team will complain of the extra load [11:51:25] /o\ [14:08:32] (MysqlReplicationLag) firing: MySQL instance db1198:9104 has too large replication lag (21h 34m 32s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1198&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [14:08:38] ^ me [14:08:42] the host crashed yesterday [14:08:45] (again) [15:23:32] (MysqlReplicationLag) resolved: MySQL instance db1198:9104 has too large replication lag (20m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1198&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag