[07:52:50] apparently the failed disk in cloudcephmon1004 was replaced ages ago but the sw raid rebuild was not triggered, so I've now done so [09:47:37] quarry broke again this weekend. T395237 seems to be the cause [09:47:38] T395237: quarry is leaking tmp files - https://phabricator.wikimedia.org/T395237 [09:50:04] also T395238 [09:50:05] T395238: quarry deployment fails with S3 403 - https://phabricator.wikimedia.org/T395238 [10:04:18] https://github.com/toolforge/quarry/pull/82 [11:13:51] there are some alerts about cloudsw1-b1-codfw.mgmt [11:14:27] yes, I'm rebooting things connected to it [11:15:06] ok, the alert looked like the switch itself was down, but I guess it's only a connection that is down? [11:15:50] yeah. the alerts are supposed to get a lot better soon when they move fron icinga to prometheus [11:16:44] nice [16:42:09] * dhinus offline