[08:27:24] I am switching pc2 back [10:16:34] Would someone like to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/993664 please? Taking the drained codfw nodes out of the rings [10:16:57] (if you want to double-check they're fully drained I left the rune in the commit message :) ) [10:25:33] done [10:25:39] Thanks :) [11:46:27] hi, can I temporarilly take over some of the db test hosts? [11:47:37] e.g. db1133 ? [11:55:28] yeah no problem from myside [12:02:49] (PuppetFailure) firing: Puppet has failed on ms-be2049:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:08:12] I will not touch it except to test a recovery and then I will delete the data [12:29:46] oh, that's a being-drained node with a dead disk. [12:31:03] (silenced) [13:29:13] I finished working with db1133 [13:47:18] there are dump grants from eqiad on codfw, for dumping [13:47:48] is it possible that cloning may have caused grants to move around? [13:48:24] I will ask Amir later to drop a few [14:52:35] topranks: I am not sure if we are going to be able to be ready for all those codfw dates [14:52:43] It implies multiple switchovers within days [14:53:25] marostegui: ok well we can work with you to try and come up with an alternate schedule [14:53:52] topranks: Yeah, I am also going to be out the whole week of the 12th, I will probably leave all those hosts to arnaudb and Amir1 but still, it is quite tight [14:54:10] is the main issue the time to prepare? or is it that doing a change on successive days is unrealistic given the time it takes to do the draining etc? [14:54:19] Probably both [14:54:34] But I still need to process all this [14:54:42] I have a meeting later with arnaudb and I will discuss it [14:55:01] marostegui: ok thanks [14:55:28] I guess the main question in my head is should we push-back the kick off date. Or stagger the moves over a longer period of time. [14:55:33] Sounds perhaps we should do both [14:56:10] topranks: Probably leaving a bit more space between dates would help a bit more, but as I said, I am still reading all the tasks and seeing which hosts/roles are affected [14:57:38] I'm actually off all this week myself which doesn't help. If you can flag on-task that you won't be ready we can make a call next week on whether to proceed with partial rack moves (skipping db hosts etc), or do a full re-schedule [15:46:19] topranks: https://phabricator.wikimedia.org/T355863 Thu is 15th, Tue is 13th, which one is the right one? [15:50:29] marostegui: good spot! Tues 13th is correct and matches what I have elsewhere, I corrected the date on the task [15:50:47] ok [15:50:49] thanks! [15:51:01] should also say we can definitely accommodate other schedules here. There is no problem with some hosts moved, some not. [15:52:02] so totally possible for us to do a “first pass” of things that are easy to move / can take a hit, then go back and do the more delicate ones. I will send an update mail later to say that, we don’t want to put any team under crazy pressure! [15:59:14] topranks: thanks, we'd probably need some of those "this can be moved, this can't yet" [15:59:26] We'll probably know better next week [16:05:16] moritzm arnaudb https://phabricator.wikimedia.org/P55649 is this fixed once the rolloout is done? I am checking db2157 (s5) and es2026 (es2) and they are both failing. I am checking them cause they are on icinga [16:07:46] One thing to consider is that we are doing a lot of schema changes that require switchovers and if we do the network maint this early, we will have to switch them over again in a couple of weeks afterwards [16:08:39] yep [16:08:41] I am aware [16:08:46] But what can I do :) [16:09:17] We can convince topranks to delay it a bit :P [16:09:29] Some of them will need to be switched over anyways, as a few days after the main master is done, the candidate master will need to go under the network mainteanance too [16:09:47] fun [16:10:19] I've already synced with arnaudb, but you too will need to work together during the week of 12th as I am out and unreachable [16:10:31] I'll do as many switches as I can before but... [16:10:36] Especially x2, which is complex [16:10:42] I'll leave that done [16:11:42] Amir1: it’s no problem to delay at all. There is no particular rush our side other than we’re already over time with the project (but that’s on us) [16:12:25] Definitely seems like a more team/server-role based schedule would have been better than the per-rack way we suggested [16:12:40] absolutely yeah [16:12:57] If you are already doing master changes for schema updates and the likes perhaps we can work around that and come up with a schedule that works better? [16:13:18] marostegui: in a meeting now, will check tomorrow morning [16:13:23] moritzm: thanks [16:13:36] topranks: yeah maybe, it is also hard to tell from our side when we are switching them (it can be many weeks) [16:15:18] I think it is better just to say, this host is ready this is not (which is only true for masters, as slaves will not have any complications, we just depool them) [16:16:13] depooling too many replicas in a section can bring down everything but I hope we at least distributed them a bit better [16:16:27] yeah we should be okay with that