[01:10:05] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 18.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:13] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:55] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:12:05] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [06:24:07] I am going to start disconnecting eqiad -> codfw replication [10:47:12] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [10:48:58] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [13:23:47] Emperor: re: switches upgrade tomorrow (https://phabricator.wikimedia.org/T329073) would you mind taking care of thanos-fe1001 ? [13:35:07] godog: sure; do you want to move the thanos-web service now, so I just have to depool/pool? [13:39:15] Emperor: sure I'll move it to thanos-fe1002 now, and tomorrow on re-pool of 1001 then 1002 will need thanos-web depool too [13:40:31] {{done}} [13:40:54] thanks, is there tooling for shifting that service around, or is it confctl runes? [13:43:06] the latter, documented here under "thanos.wikimedia.org" https://wikitech.wikimedia.org/wiki/Thanos#Pool_/_depool_a_site [14:20:33] o/ [15:11:16] Emperor: should I move forward with these front-ends then? [15:12:22] urandom: please do :) [15:18:10] * urandom does [15:42:22] Emperor: "Run puppet on the remaining frontend hosts to make sure firewall rules are updated" <-- is that, "in the same DC", or "all front-ends"? [15:42:59] not that running puppet where it wasn't necessary would be a problem, but I'm curious [15:43:22] running puppet is the cause of most problems though :-P [15:43:57] it's the cause-of and solution-to all the world's problems [15:44:11] indeed :D [15:44:20] it's the software equivalent to beer [15:55:49] urandom: I do _all_, but all in the same DC should be sufficient [15:56:02] Emperor: OK, I did all [15:56:43] I think the link to pooling is wrong/dead, but that maybe https://wikitech.wikimedia.org/wiki/Conftool#Add_a_server_node_to_a_service is correct? [15:58:18] sigh. [15:59:27] I feel like I've been here before, that the nodes (though defined under `conftool-data/node`) won't show at https://config-master.wikimedia.org/pybal/codfw/swift until they've been given a weight and pooled [15:59:33] urandom: yes, that looks about right - set weight and pool. You need both swift-fe and nginx services, and IIRC the usual weight is 40 [15:59:53] (so typically 4 confctl invocations per host) [16:03:49] Emperor: ok, based on https://config-master.wikimedia.org/pybal/codfw/swift, this doesn't seem to have worked [16:04:22] doh [16:04:29] nevermind... I guess there is a delay [16:05:18] I see ms-fe2014 still weight 0 / inactive, and ms-fe2013 weight 40 / pooled yes [16:05:35] on a cumin node, sudo confctl select cluster=swift,dc=codfw get [16:05:37] yeah, I started w/ 2013 [16:05:46] might be quicker/easier [16:06:32] Emperor: I think we had a race between me saying and you checking; I had started with one host first :) [16:06:47] ah [16:07:04] you can do multiple hosts at teh same time if needed ;) [16:07:37] volans: I had two hosts to add, and only wanted to mess up half as bad if I was doing it Wrong™ [16:07:45] * urandom maths [16:08:02] :D [16:10:15] Emperor: it's definitely doing Stuff, what else would you look at to be sure things were good? [16:16:51] I think if the kitten test worked OK and the dashboards look happy, I'm happy [16:17:35] (and they do indeed look OK) [16:21:02] heh, the "kitten test" [16:36:58] it's important :)