[01:10:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 15 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:10:35] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 13 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:12:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[13:11:07] <Emperor>	 Mar 10 00:00:11 ms-fe2013 proxy-server: ERROR Insufficient Storage 10.192.16.160:6029/sdv1 (txn: tx9074dc58dc46436086063-00640a730b) (client_ip: 2601:646:8500:fc30:1870:ae88:78a0:a41e)
[13:11:19] <Emperor>	 Mar 10 00:00:13 ms-fe2013 proxy-server: ERROR Insufficient Storage 10.192.16.160:6033/sdz1 (txn: tx227589674ffc4d488564e-00640a730d) (client_ip: 98.49.254.105)
[13:11:41] <Emperor>	 looks like ms-be2067 is saying ENOSPC on for these two failed devices (cf T331030 )
[13:11:41] <stashbot>	 T331030: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030
[13:16:39] <Emperor>	 though oddly in the first case the corresponding request seems to have been a GET which we said 304 to
[13:17:43] <Emperor>	 likewise the second; both were thumb requests, though, so it was presumably the attempt to write the thumb that was failing
[13:18:15] <Emperor>	 Given it's now been over a week since I opened the ticket, I'm inclined to fail those drives out of the rings, although it'll cause a bunch of extra load.
[13:18:20] <Emperor>	 godog: ^-- seem reasonable?
[13:20:22] <Emperor>	 [historically we've waited for drive swaps, but they seem to take a long time, and I worry it's contributing to some of the errors we're seeing]
[13:31:42] <Emperor>	 (contra-wise, e.g. swift-dispersion-report correctly reports that the devices are unmounted)
[13:44:24] <Emperor>	 also noted ms-be2067's oldest completion was 24 days ago, restarted the object-replicator :-/
[16:54:48] <Emperor>	 (the Insufficient Storage errors predate the 13:00 on the 19th rise in 502s in both codfw and eqiad, so I think this was a red herring)
[17:05:28] <Emperor>	 ...and in any case the frontend proxy-server.log is 500s whereas the ATS graph notes a rise in 502s, which seem not to correspond with any swift log entries