[07:43:55] <Emperor>	 ms-be2045 has been solid over the weekend (and I ran stress against it for 66 hours), so I think I'll put it back into the ring later
[07:46:18] <godog>	 SGTM
[08:25:23] <godog>	 Emperor: I'll be kicking off the last rebalance for new eqiad hw
[08:29:05] <Emperor>	 👍
[08:50:00] <Emperor>	 godog: I went looking at swift-recon -r in equiad and codfw (apropos whether it says something useful about unfinished rebalancing); in equiad oldest/newest completions are ~20m apart, but in codfw (which I don't think is rebalancing), they're 11 days apart. cf https://phabricator.wikimedia.org/P17451 I think I'm misunderstanding something here...
[08:59:48] <godog>	 Emperor: yeah 11 days apart shouldn't be the case in normal circumstances, looks like ms-be2036 is the culprit, I'm tempted to bounce object-replicator there
[09:01:24] <godog>	 I'll do so
[09:03:43] <Emperor>	 (ah, that's the "Oldest completion" host)
[09:11:21] <godog>	 that's correct yeah
[09:16:04] <Emperor>	 it's catching up now
[10:47:48] <Emperor>	 mvernon@ms-be2045:~$ sudo apt-get remove stress # if only it was always this easy to de-stress ;-)
[10:50:38] <jynus>	 db1119 root cause doesn't seem clear, but immediate cause for db1119 seems to be prometheus connection pileup: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&var-job=All&var-server=db1119&var-port=9104&from=1633768746685&to=1633773318550 (which the max_user_connections prevented to make worse)
[10:56:30] <jynus>	 sadly the metrics that could give us some insight is what was preciselly lost :-(