[15:29:51] <urandom>	 marostegui: so does T350152 require that everything in rows A & B be renumbered?
[15:29:52] <stashbot>	 T350152: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152
[15:30:53] <marostegui>	 urandom: that's my understanding, but topranks probably knows better 
[15:31:17] <marostegui>	 You mean changing IPs right?
[15:31:23] <marostegui>	 If so, that's my understanding 
[15:32:48] <urandom>	 yeah
[15:33:36] <topranks>	 urandom: yeah unfortunately 
[15:34:01] <topranks>	 we are going to move the servers initially to the new network devices and leave them on the existing vlans, i.e. no renumbering 
[15:34:09] <topranks>	 so we can decom the old gear in good time
[15:34:41] <topranks>	 but ultimately we will need to renumber all the servers to move to move from rack to row availability model 
[15:34:58] <urandom>	 isn't that going to be really disruptive?
[15:35:35] <topranks>	 it is 
[15:35:58] <volans>	 in the end we'll need to do that for all the rows given we're moving from a row-level to a rack-level network
[15:36:14] <volans>	 but is not that disruptive with a bit of automation
[15:36:23] <volans>	 fqdns will stay the same
[15:36:35] <topranks>	 yeah - that task is one of the bits we are working on to make it easier 
[15:37:22] <urandom>	 volans: what do you mean fqdns will be the same, the records will be migrated won't they?
[15:37:39] <topranks>	 The hostnames won't change, but what they point to will 
[15:37:43] <urandom>	 right
[15:37:58] <topranks>	 so if something is configured to talk to "server1234.codfw.wmnet" that will work the same 
[15:38:00] <volans>	 yeah I meant that hostnames and FQDNs will not change :D
[15:39:29] <Emperor>	 I'm in the market for reviews of https://gerrit.wikimedia.org/r/c/operations/puppet/+/989529 and https://gerrit.wikimedia.org/r/c/labs/private/+/989531 to make a new swift user if someone would like to 👀 please?
[15:39:31] <topranks>	 ultimately it gets us to a more stable, scalable and simpler network
[15:39:31] <urandom>	 we'll have to test Cassandra, back-in-the-day, it used the IP address as the canonical way of identifying a node
[15:39:52] <Emperor>	 swift rings are IP-based, too.
[15:40:29] <topranks>	 urandom: I actually need to catch up with you on the requirements for Cassandra instances in general - relating to T346428
[15:40:29] <urandom>	 that has supposed to have been fixed, but there were a *lot* of things that made assumptions there, and I'm not at all confident it's been tested well (since it's just not something people would change often)
[15:40:29] <stashbot>	 T346428: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428
[15:41:32] <topranks>	 urandom, Emerpor: thanks for the info on those
[15:41:43] <Emperor>	 and the swift zones (that ensure replication &c don't end up in the same rack) likewise
[15:41:46] <topranks>	 we did expect that some things will require additional work aside from the raw IP changes, so we need to step cautiously 
[15:41:50] <urandom>	 of course, I can imagine that depending on the timing of automation, and the length of ttls, this could result in a partition even if it otherwise "works"
[15:41:58] <urandom>	 a partition, even if it's transient
[15:43:03] <topranks>	 maybe it would help if I create phab tasks for each of these node types (Swift / Cassandra), and we can discuss the mechanics / what might be involved?
[15:43:36] <urandom>	 yeah, we probably need a ticket
[15:43:56] <Emperor>	 topranks: +1
[15:44:00] <topranks>	 from my point of total ignorance about those systems I think the best way would be to downtime host, depool, renumber/reconfigure, then bring it back up
[15:44:08] <topranks>	 ok I'll do that shortly 
[15:44:39] <Emperor>	 topranks: I suspect for swift nodes they'll need to be drained (which takes O(weeks) ) removed from the rings and re-added at the new IP and then re-loaded (likewise taking O(weeks))
[15:45:07] <topranks>	 yeah that's not surprising - we will need to work out a schedule 
[15:45:15] <urandom>	 as a disinterested onlooker, I would be confident that it will Just Work™, but as someone who will have to pick up the pieces if it does not, I'm terrified and want to liberally test first :)
[15:45:34] <topranks>	 yes absolutely 
[15:45:37] <Emperor>	 If we were doing that anyway, we might want to also fix the disk layout while we were at it
[15:45:56] * Emperor passes topranks some quite hairy yaks
[15:46:01] <urandom>	 If it can't be done, the process would be the same as Emperor notes for Swift (minus the O(weeks) part)
[15:46:02] <topranks>	 certainly might be a good opportunity to bundle any other disruptive things you'd like to change 
[15:46:51] <topranks>	 first step for us is to move the physical servers to the new top-of-rack switches 
[15:47:11] <topranks>	 after which we need to consider every type of node and what's involved in renumbering 
[15:49:49] <Emperor>	 swift-ring-builder has set-info, which _might_ let us change IPs on the fly...
[15:50:11] <Emperor>	 (but is a bit fear-inducing)
[15:51:02] <volans>	 Emperor: should we plan to migrate the ring o use some other identifier maybe? could we keep the old IP as identifier?
[15:51:26] <topranks>	 FWIW I don't think the change can be done without some disruption to network connectivity 
[15:51:38] <volans>	 it seems strange to me that there isn't a way to re-map a host from an identifier to another without the full drain+reload
[15:52:53] <marostegui>	 topranks: for databases it's going to be fun....as we need to also change dbctl as it has IPs
[15:53:03] <topranks>	 :(
[15:53:06] <marostegui>	 So yeah +1 to have a ticket
[15:53:33] <topranks>	 I'll make a master ticket and we can add child ones for each type of node 
[15:53:34] <Emperor>	 volans: the use of IP as identifier in the rings is a swift thing not an WMF-swift thing
[15:53:55] <topranks>	 So far I have cassandra, swift and databases, we're working with service ops about kubernetes already so I'll link that too 
[15:57:00] <volans>	 topranks: for DBs add also the grants, that are per-IP ;)
[15:57:07] <volans>	 or per-"subnet"
[15:57:34] <topranks>	 volans: good shout, will do 
[15:57:45] <volans>	 marostegui: for dbctl I don't think it will be a problem, just an additional step
[15:58:10] <volans>	 depool, renumber, update dbctl (that updates mediawiki), repool with the new IP
[15:58:35] <volans>	 with renumber being whatever the usual steps for renumbering will be 
[15:58:58] <marostegui>	 volans: Yes, we'll need to switch masters and all that before 
[15:59:08] <marostegui>	 To promote hosts which are already migrated 
[15:59:38] <volans>	 sure
[16:16:54] <jynus>	 also server-ids used to depend on ips
[16:16:59] <jynus>	 which affects binlogs
[16:17:13] <jynus>	 not sure if still does
[16:24:03] <marostegui>	 That's at least fixed with the puppet run