[01:10:01] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 15 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:35] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 13 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:49] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:23] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [13:11:07] Mar 10 00:00:11 ms-fe2013 proxy-server: ERROR Insufficient Storage 10.192.16.160:6029/sdv1 (txn: tx9074dc58dc46436086063-00640a730b) (client_ip: 2601:646:8500:fc30:1870:ae88:78a0:a41e) [13:11:19] Mar 10 00:00:13 ms-fe2013 proxy-server: ERROR Insufficient Storage 10.192.16.160:6033/sdz1 (txn: tx227589674ffc4d488564e-00640a730d) (client_ip: 98.49.254.105) [13:11:41] looks like ms-be2067 is saying ENOSPC on for these two failed devices (cf T331030 ) [13:11:41] T331030: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 [13:16:39] though oddly in the first case the corresponding request seems to have been a GET which we said 304 to [13:17:43] likewise the second; both were thumb requests, though, so it was presumably the attempt to write the thumb that was failing [13:18:15] Given it's now been over a week since I opened the ticket, I'm inclined to fail those drives out of the rings, although it'll cause a bunch of extra load. [13:18:20] godog: ^-- seem reasonable? [13:20:22] [historically we've waited for drive swaps, but they seem to take a long time, and I worry it's contributing to some of the errors we're seeing] [13:31:42] (contra-wise, e.g. swift-dispersion-report correctly reports that the devices are unmounted) [13:44:24] also noted ms-be2067's oldest completion was 24 days ago, restarted the object-replicator :-/ [16:54:48] (the Insufficient Storage errors predate the 13:00 on the 19th rise in 502s in both codfw and eqiad, so I think this was a red herring) [17:05:28] ...and in any case the frontend proxy-server.log is 500s whereas the ATS graph notes a rise in 502s, which seem not to correspond with any swift log entries