[07:43:55] ms-be2045 has been solid over the weekend (and I ran stress against it for 66 hours), so I think I'll put it back into the ring later [07:46:18] SGTM [08:25:23] Emperor: I'll be kicking off the last rebalance for new eqiad hw [08:29:05] 👍 [08:50:00] godog: I went looking at swift-recon -r in equiad and codfw (apropos whether it says something useful about unfinished rebalancing); in equiad oldest/newest completions are ~20m apart, but in codfw (which I don't think is rebalancing), they're 11 days apart. cf https://phabricator.wikimedia.org/P17451 I think I'm misunderstanding something here... [08:59:48] Emperor: yeah 11 days apart shouldn't be the case in normal circumstances, looks like ms-be2036 is the culprit, I'm tempted to bounce object-replicator there [09:01:24] I'll do so [09:03:43] (ah, that's the "Oldest completion" host) [09:11:21] that's correct yeah [09:16:04] it's catching up now [10:47:48] mvernon@ms-be2045:~$ sudo apt-get remove stress # if only it was always this easy to de-stress ;-) [10:50:38] db1119 root cause doesn't seem clear, but immediate cause for db1119 seems to be prometheus connection pileup: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&var-job=All&var-server=db1119&var-port=9104&from=1633768746685&to=1633773318550 (which the max_user_connections prevented to make worse) [10:56:30] sadly the metrics that could give us some insight is what was preciselly lost :-(