[01:10:05] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 18.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:10:13] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:55] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:12:05] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[06:24:07] <marostegui>	 I am going to start disconnecting eqiad -> codfw replication
[10:47:12] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[10:48:58] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[13:23:47] <godog>	 Emperor: re: switches upgrade tomorrow (https://phabricator.wikimedia.org/T329073) would you mind taking care of thanos-fe1001 ?
[13:35:07] <Emperor>	 godog: sure; do you want to move the thanos-web service now, so I just have to depool/pool?
[13:39:15] <godog>	 Emperor: sure I'll move it to thanos-fe1002 now, and tomorrow on re-pool of 1001 then 1002 will need thanos-web depool too
[13:40:31] <godog>	 {{done}}
[13:40:54] <Emperor>	 thanks, is there tooling for shifting that service around, or is it confctl runes?
[13:43:06] <godog>	 the latter, documented here under "thanos.wikimedia.org" https://wikitech.wikimedia.org/wiki/Thanos#Pool_/_depool_a_site
[14:20:33] <urandom>	 o/
[15:11:16] <urandom>	 Emperor: should I move forward with these front-ends then?
[15:12:22] <Emperor>	 urandom: please do :)
[15:18:10] * urandom does
[15:42:22] <urandom>	 Emperor: "Run puppet on the remaining frontend hosts to make sure firewall rules are updated" <-- is that, "in the same DC", or "all front-ends"?
[15:42:59] <urandom>	 not that running puppet where it wasn't necessary would be a problem, but I'm curious
[15:43:22] <volans>	 running puppet is the cause of most problems though :-P
[15:43:57] <urandom>	 it's the cause-of and solution-to all the world's problems
[15:44:11] <volans>	 indeed :D
[15:44:20] <urandom>	 it's the software equivalent to beer
[15:55:49] <Emperor>	 urandom: I do _all_, but all in the same DC should be sufficient
[15:56:02] <urandom>	 Emperor: OK, I did all
[15:56:43] <urandom>	 I think the link to pooling is wrong/dead, but that maybe https://wikitech.wikimedia.org/wiki/Conftool#Add_a_server_node_to_a_service is correct?
[15:58:18] <Emperor>	 sigh.
[15:59:27] <urandom>	 I feel like I've been here before, that the nodes (though defined under `conftool-data/node`) won't show at https://config-master.wikimedia.org/pybal/codfw/swift until they've been given a weight and pooled
[15:59:33] <Emperor>	 urandom: yes, that looks about right - set weight and pool. You need both swift-fe and nginx services, and IIRC the usual weight is 40
[15:59:53] <Emperor>	 (so typically 4 confctl invocations per host)
[16:03:49] <urandom>	 Emperor: ok, based on https://config-master.wikimedia.org/pybal/codfw/swift, this doesn't seem to have worked
[16:04:22] <urandom>	 doh
[16:04:29] <urandom>	 nevermind... I guess there is a delay
[16:05:18] <Emperor>	 I see ms-fe2014 still weight 0 / inactive, and ms-fe2013 weight 40 / pooled yes
[16:05:35] <Emperor>	 on a cumin node, sudo confctl select cluster=swift,dc=codfw get 
[16:05:37] <urandom>	 yeah, I started w/ 2013
[16:05:46] <Emperor>	 might be quicker/easier 
[16:06:32] <urandom>	 Emperor: I think we had a race between me saying and you checking; I had started with one host first :)
[16:06:47] <Emperor>	 ah 
[16:07:04] <volans>	 you can do multiple hosts at teh same time if needed ;)
[16:07:37] <urandom>	 volans: I had two hosts to add, and only wanted to mess up half as bad if I was doing it Wrong™
[16:07:45] * urandom maths
[16:08:02] <volans>	 :D
[16:10:15] <urandom>	 Emperor: it's definitely doing Stuff, what else would you look at to be sure things were good?
[16:16:51] <Emperor>	 I think if the kitten test worked OK and the dashboards look happy, I'm happy
[16:17:35] <Emperor>	 (and they do indeed look OK)
[16:21:02] <urandom>	 heh, the "kitten test"
[16:36:58] <Emperor>	 it's important :)