[04:45:51] <marostegui>	 Going go keep warming up eqiad during the day
[04:46:02] <marostegui>	 In case you see weird graphs in eqiad
[04:46:57] * majavah unplugs the air conditioning
[04:48:11] <marostegui>	 XDD
[07:48:29] * Emperor still a bit traumatised from the time the Sanger DC aircon failed :-/
[09:39:46] <godog>	 ok first swift small rebalance kicked off in eqiad, this is basically the commands in 5d82ce6966 + 'make' in swift-ring and then 'make deploy'
[09:40:24] <godog>	 then puppet deploys the new rings on the next run
[09:41:08] <Emperor>	 this is swift-ring.git ?
[09:41:29] <godog>	 that's correct yes
[09:43:05] <godog>	 I started with a small object weight to validate things are working as expected, shouldn't take too long to converge and then we can use larger weights
[09:43:24] <godog>	 pretty much similar to what happened in codfw
[09:44:19] <godog>	 where the hosts are not yet at full weight but that's okay, I wanted to at least get the hosts some weight
[10:02:40] <jynus>	 I will measure the backup speed under the new config, once it has been active for enough time to average it
[10:04:55] <jynus>	 I see the big slowdown in requests at 9:30, will measure it at the 10:30 mark or so
[10:26:05] <godog>	 yeah expected, so basically the cluster now is rebalancing and the object dispersion is going to go back to 99.5 here 
[10:26:08] <godog>	 https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops&refresh=1m
[10:26:15] <godog>	 99.5 because we have two broken disks ATM
[10:26:20] <godog>	 usually at 100%
[10:27:07] <godog>	 that "dispersion" covers I think sth like 6 or 7 percent of the object space, a good indicator if objects are where they are supposed to be or not
[10:28:26] <godog>	 the complementary indicator being 'swift-recon -r' on e.g. ms-fe1005 that shows oldest and newest completed replication, they are going to drift apart and the come close again once rebalancing has finished
[10:29:25] <jynus>	 the lowest level I am working with is the proxy api, so re:backups that shouldn't affect me
[10:29:42] <jynus>	 if I get errors, those are recorded, and normally I retry them once again later
[10:30:29] <jynus>	 but most errors I am getting so far are due to mw model, not infrastructure
[12:54:35] <godog>	 hi, re: T290591 did I understand correctly that kormat's patch would bandaid the immediate problem for now, but it'll still be useful to imagine how things would work in alertmanager? 
[12:54:35] <stashbot>	 T290591: Database read_only alert has a changing service description - https://phabricator.wikimedia.org/T290591
[13:38:06] <kormat>	 godog: alertmanager is easy, unless i'm missing something
[13:38:24] <kormat>	 the fundamental issue we've run into is that an icinga 'service' is stringly-typed
[13:38:39] <kormat>	 and downtimes can only select services that are currently known to icinga
[13:38:46] <kormat>	 am doesn't have either of those issues
[13:39:03] <kormat>	 you can use regexes, and if a silence doesn't match anything now, it will still be evaluated and take effect if something shows up later
[13:41:54] <godog>	 kormat: indeed, thanks for clarifying the core of the issue
[13:42:32] <godog>	 yeah that you can't downtime services that don't exists I understand but it bugs me to no end
[13:42:49] <kormat>	 godog: btw, is our team is unusual in having alerts be critical/non-critical depending on whether a given DC is active or not?
[13:43:37] <godog>	 good question, I'd say yes off the top of my head
[13:43:44] <kormat>	 huh, good to know :)
[13:44:04] <kormat>	 maybe that explains why volans was so horrified ;)
[13:44:51] <godog>	 lolz
[13:45:09] <volans>	 lol
[13:45:27] <godog>	 but yeah paging services afaict usually are active all the time irrespective of which site is active
[13:46:16] <godog>	 some overloading of 'active' there but you get the idea
[13:55:58] <kormat>	 this mostly goes away in a multi-DC world, but we trade slightly simpler alerting for vastly more painful maintenance :(