[04:45:51] Going go keep warming up eqiad during the day [04:46:02] In case you see weird graphs in eqiad [04:46:57] * majavah unplugs the air conditioning [04:48:11] XDD [07:48:29] * Emperor still a bit traumatised from the time the Sanger DC aircon failed :-/ [09:39:46] ok first swift small rebalance kicked off in eqiad, this is basically the commands in 5d82ce6966 + 'make' in swift-ring and then 'make deploy' [09:40:24] then puppet deploys the new rings on the next run [09:41:08] this is swift-ring.git ? [09:41:29] that's correct yes [09:43:05] I started with a small object weight to validate things are working as expected, shouldn't take too long to converge and then we can use larger weights [09:43:24] pretty much similar to what happened in codfw [09:44:19] where the hosts are not yet at full weight but that's okay, I wanted to at least get the hosts some weight [10:02:40] I will measure the backup speed under the new config, once it has been active for enough time to average it [10:04:55] I see the big slowdown in requests at 9:30, will measure it at the 10:30 mark or so [10:26:05] yeah expected, so basically the cluster now is rebalancing and the object dispersion is going to go back to 99.5 here [10:26:08] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops&refresh=1m [10:26:15] 99.5 because we have two broken disks ATM [10:26:20] usually at 100% [10:27:07] that "dispersion" covers I think sth like 6 or 7 percent of the object space, a good indicator if objects are where they are supposed to be or not [10:28:26] the complementary indicator being 'swift-recon -r' on e.g. ms-fe1005 that shows oldest and newest completed replication, they are going to drift apart and the come close again once rebalancing has finished [10:29:25] the lowest level I am working with is the proxy api, so re:backups that shouldn't affect me [10:29:42] if I get errors, those are recorded, and normally I retry them once again later [10:30:29] but most errors I am getting so far are due to mw model, not infrastructure [12:54:35] hi, re: T290591 did I understand correctly that kormat's patch would bandaid the immediate problem for now, but it'll still be useful to imagine how things would work in alertmanager? [12:54:35] T290591: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591 [13:38:06] godog: alertmanager is easy, unless i'm missing something [13:38:24] the fundamental issue we've run into is that an icinga 'service' is stringly-typed [13:38:39] and downtimes can only select services that are currently known to icinga [13:38:46] am doesn't have either of those issues [13:39:03] you can use regexes, and if a silence doesn't match anything now, it will still be evaluated and take effect if something shows up later [13:41:54] kormat: indeed, thanks for clarifying the core of the issue [13:42:32] yeah that you can't downtime services that don't exists I understand but it bugs me to no end [13:42:49] godog: btw, is our team is unusual in having alerts be critical/non-critical depending on whether a given DC is active or not? [13:43:37] good question, I'd say yes off the top of my head [13:43:44] huh, good to know :) [13:44:04] maybe that explains why volans was so horrified ;) [13:44:51] lolz [13:45:09] lol [13:45:27] but yeah paging services afaict usually are active all the time irrespective of which site is active [13:46:16] some overloading of 'active' there but you get the idea [13:55:58] this mostly goes away in a multi-DC world, but we trade slightly simpler alerting for vastly more painful maintenance :(