[15:29:51] marostegui: so does T350152 require that everything in rows A & B be renumbered? [15:29:52] T350152: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 [15:30:53] urandom: that's my understanding, but topranks probably knows better [15:31:17] You mean changing IPs right? [15:31:23] If so, that's my understanding [15:32:48] yeah [15:33:36] urandom: yeah unfortunately [15:34:01] we are going to move the servers initially to the new network devices and leave them on the existing vlans, i.e. no renumbering [15:34:09] so we can decom the old gear in good time [15:34:41] but ultimately we will need to renumber all the servers to move to move from rack to row availability model [15:34:58] isn't that going to be really disruptive? [15:35:35] it is [15:35:58] in the end we'll need to do that for all the rows given we're moving from a row-level to a rack-level network [15:36:14] but is not that disruptive with a bit of automation [15:36:23] fqdns will stay the same [15:36:35] yeah - that task is one of the bits we are working on to make it easier [15:37:22] volans: what do you mean fqdns will be the same, the records will be migrated won't they? [15:37:39] The hostnames won't change, but what they point to will [15:37:43] right [15:37:58] so if something is configured to talk to "server1234.codfw.wmnet" that will work the same [15:38:00] yeah I meant that hostnames and FQDNs will not change :D [15:39:29] I'm in the market for reviews of https://gerrit.wikimedia.org/r/c/operations/puppet/+/989529 and https://gerrit.wikimedia.org/r/c/labs/private/+/989531 to make a new swift user if someone would like to 👀 please? [15:39:31] ultimately it gets us to a more stable, scalable and simpler network [15:39:31] we'll have to test Cassandra, back-in-the-day, it used the IP address as the canonical way of identifying a node [15:39:52] swift rings are IP-based, too. [15:40:29] urandom: I actually need to catch up with you on the requirements for Cassandra instances in general - relating to T346428 [15:40:29] that has supposed to have been fixed, but there were a *lot* of things that made assumptions there, and I'm not at all confident it's been tested well (since it's just not something people would change often) [15:40:29] T346428: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 [15:41:32] urandom, Emerpor: thanks for the info on those [15:41:43] and the swift zones (that ensure replication &c don't end up in the same rack) likewise [15:41:46] we did expect that some things will require additional work aside from the raw IP changes, so we need to step cautiously [15:41:50] of course, I can imagine that depending on the timing of automation, and the length of ttls, this could result in a partition even if it otherwise "works" [15:41:58] a partition, even if it's transient [15:43:03] maybe it would help if I create phab tasks for each of these node types (Swift / Cassandra), and we can discuss the mechanics / what might be involved? [15:43:36] yeah, we probably need a ticket [15:43:56] topranks: +1 [15:44:00] from my point of total ignorance about those systems I think the best way would be to downtime host, depool, renumber/reconfigure, then bring it back up [15:44:08] ok I'll do that shortly [15:44:39] topranks: I suspect for swift nodes they'll need to be drained (which takes O(weeks) ) removed from the rings and re-added at the new IP and then re-loaded (likewise taking O(weeks)) [15:45:07] yeah that's not surprising - we will need to work out a schedule [15:45:15] as a disinterested onlooker, I would be confident that it will Just Work™, but as someone who will have to pick up the pieces if it does not, I'm terrified and want to liberally test first :) [15:45:34] yes absolutely [15:45:37] If we were doing that anyway, we might want to also fix the disk layout while we were at it [15:45:56] * Emperor passes topranks some quite hairy yaks [15:46:01] If it can't be done, the process would be the same as Emperor notes for Swift (minus the O(weeks) part) [15:46:02] certainly might be a good opportunity to bundle any other disruptive things you'd like to change [15:46:51] first step for us is to move the physical servers to the new top-of-rack switches [15:47:11] after which we need to consider every type of node and what's involved in renumbering [15:49:49] swift-ring-builder has set-info, which _might_ let us change IPs on the fly... [15:50:11] (but is a bit fear-inducing) [15:51:02] Emperor: should we plan to migrate the ring o use some other identifier maybe? could we keep the old IP as identifier? [15:51:26] FWIW I don't think the change can be done without some disruption to network connectivity [15:51:38] it seems strange to me that there isn't a way to re-map a host from an identifier to another without the full drain+reload [15:52:53] topranks: for databases it's going to be fun....as we need to also change dbctl as it has IPs [15:53:03] :( [15:53:06] So yeah +1 to have a ticket [15:53:33] I'll make a master ticket and we can add child ones for each type of node [15:53:34] volans: the use of IP as identifier in the rings is a swift thing not an WMF-swift thing [15:53:55] So far I have cassandra, swift and databases, we're working with service ops about kubernetes already so I'll link that too [15:57:00] topranks: for DBs add also the grants, that are per-IP ;) [15:57:07] or per-"subnet" [15:57:34] volans: good shout, will do [15:57:45] marostegui: for dbctl I don't think it will be a problem, just an additional step [15:58:10] depool, renumber, update dbctl (that updates mediawiki), repool with the new IP [15:58:35] with renumber being whatever the usual steps for renumbering will be [15:58:58] volans: Yes, we'll need to switch masters and all that before [15:59:08] To promote hosts which are already migrated [15:59:38] sure [16:16:54] also server-ids used to depend on ips [16:16:59] which affects binlogs [16:17:13] not sure if still does [16:24:03] That's at least fixed with the puppet run