Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Oct 17 2024, 2:51 PM

Description

Problem

Per T364725: Migrate Cloud VPS instances to VXLAN based networks, we need to migrate virtual machines from the old VLAN-based subnet to the new VXLAN-based subnet (which includes IPv6).

There are, however, different ways in which this can be done, depending on a number of factors, such as:

how much effort we want to put into it
how fast we want the migration to happen
what level of disruption is acceptable for our users
how confident we are that everything will just "work", i.e, Toolforge migrating to IPv6, there be dragons

Constraints and risks

Migrating a virtual machine to the new network requires downtime, either:

a reboot with a new neutron port
the VM is completely new

Also, given there is a new IP address, it involves DNS changes.

Options

Option 1

Based on VM migration. Triggered by WMCS team with no projects admin intervention.

Write a script that takes a VM and 'moves' it to the new network setup.

This is T377346: openstack: develop a script to migrate a VM instance from the old network setting (vlan) to the new (vxlan, IPv6).

Pros:

can be effective in completing the migration somewhat "fast"

Cons:

crafting the script can be a costly task (in terms of engineering time)
it may involve introducing artificial downtime for user VMs
it may involve modifying the VM filesystem, which sounds scary
it is less "clean" compared to option 2
risk of introducing IPv6 without control for systems that may break if not ready

Option 2

Based on VM rebuilds. Triggered by projects admin on a self-service fashion.

If a VM needs to move to the new network setup, it needs to be rebuilt. This is executed as a self-service thing via normal user workflows (i.e, horizon, tofu) from users.

We could start with 3 network definitions in neutrons, available via horizon:

VLAN/legacy
VXLAN/IPv4-only
VXLAN/IPv6-dualstack

Then we could have a migration timeline similar to this:

2024-12-01: announcement about the transition. 3 network options available in horizon
2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.
[ .. from this point on the migration is progressing organically .. ]
2025-12-01: (1 year later) we evaluate how the migration is progressing, and maybe automate some of if with a script if we need to accelerate it.
2026-12-01: (2 years later) we expect no VMs in the legacy VLAN to exist. If some exist, we will evaluate what to do.
20XX-XX-XX: (at some point TBD) we may want to disable VXLAN/IPv4-only VM creation options, or keep it only for special cases upon requests.

Pros:

no additional engineering time required from WMCS to invest in migration scripts and such
no artificial downtime. A project admin explicitly created a new virtual machine via horizon. Clean.
the introduction of the new IPv6 is fully in control of the project admin
a shiny new IPv6 may be a good incentive for users to do the migration soon.

Cons:

not automated, requires project admin intervention. We require actions from the community
will delay completion of the network migration

Option 3

Mixed approach. Focus on the self-service VM rebuild approach, but create a script to handle some other complex cases.

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Open	None	T87220 Minimize infrastructure differences between Beta Cluster and production
Stalled	None	T211677 Support IPv6 in beta
Open	None	T270694 CloudVPS: introduce tenant networks
Stalled	aborrero	T364725 Migrate Cloud VPS instances to VXLAN based networks
Stalled	None	T211575 Enable IPv6 on toolforge.org
Open	None	T220306 Add IPv6 monitoring
Open	taavi	T379175 Enable IPv6 for the Cloud VPS web proxy
Open	None	T37947 Enable IPv6 on CloudVPS
Resolved	aborrero	T377467 Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration

Event Timeline

aborrero created this task.Oct 17 2024, 2:51 PM

Restricted Application removed a subscriber: taavi. · View Herald TranscriptOct 17 2024, 2:51 PM

aborrero triaged this task as Medium priority.Oct 17 2024, 2:51 PM

aborrero added a project: Cloud Services Proposals.

aborrero moved this task from Unsorted to Network on the Cloud-VPS board.

aborrero moved this task from Backlog to Blocked/waiting on the User-aborrero board.

aborrero moved this task from Inbox to Discussion on the Cloud Services Proposals board.

aborrero updated the task description. (Show Details)Oct 17 2024, 2:58 PM

aborrero updated the task description. (Show Details)Oct 17 2024, 3:51 PM

taavi added a parent task: T37947: Enable IPv6 on CloudVPS.Oct 18 2024, 5:23 PM

aborrero mentioned this in T377346: openstack: develop a script to migrate a VM instance from the old network setting (vlan) to the new (vxlan, IPv6).Oct 21 2024, 12:14 PM

aborrero mentioned this in T364725: Migrate Cloud VPS instances to VXLAN based networks.Oct 21 2024, 12:34 PM

As the days pass, and I keep reflecting on this ticket, I think I clearly see option 2 as the only sustainable way forward.

Changing the IP address of a VM without rebuild has so many ramifications that I don't see how we could realistically make it work.

I'm referring to changes to DNS, floating IPs, proxys, security groups, monitoring, whatever user configurations may have inside the VM, etc.

It also violates the "cattle not pets" principle.

aborrero updated the task description. (Show Details)Oct 22 2024, 3:31 PM

(Automatically) renumbering VMs is already scary. Giving them v6 addresses is even scarier.

Two questions:

Can we add a host to the new VXLAN subnet without giving it a v6 address?
How plausible is it make to a self-service migration thing, letting users choose whether to re-create or to migrate?

How did we handle renumbering when we did the nova network to neutron migration? I have a vague memory of @Andrew figuring out some deep magic that involved direct manipulation of openstack's database. Does that sound familiar to anyone else?

In T377467#10251003, @taavi wrote:

(Automatically) renumbering VMs is already scary. Giving them v6 addresses is even scarier.

Two questions:

Can we add a host to the new VXLAN subnet without giving it a v6 address?

I guess this is in the realm of possible things to do. But because we had previously agreed that the migration should involve the two things, I have not tested this and I doubt the current setup allows.

We would need to create a new openstack network so the user has choices in the horizon VM creation panel:

use old VLAN network
use new VXLAN network (IPv4 only)
use new VXLAN network (IPv4/IPv6 dual stack)

Having this option may not be a bad idea for Toolforge for example.

How plausible is it make to a self-service migration thing, letting users choose whether to re-create or to migrate?

How would you trigger it? Via horizon?

Setting the migration logic to be self-service will contribute to make it even more complex to implement than it already is, because permissions and such.

In T377467#10251037, @bd808 wrote:

How did we handle renumbering when we did the nova network to neutron migration? I have a vague memory of @Andrew figuring out some deep magic that involved direct manipulation of openstack's database. Does that sound familiar to anyone else?

We used scripts:

In T377467#10251120, @aborrero wrote:

In T377467#10251037, @bd808 wrote:

How did we handle renumbering when we did the nova network to neutron migration? I have a vague memory of @Andrew figuring out some deep magic that involved direct manipulation of openstack's database. Does that sound familiar to anyone else?

We used scripts:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/08dbc610169968d6a73117df774138943e8c01b0/modules/openstack/files/train/admin_scripts/wmcs-region-migrate.py

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/08dbc610169968d6a73117df774138943e8c01b0/modules/openstack/files/train/admin_scripts/wmcs-region-migrate-security-groups.py

Thanks for digging those scripts up @aborrero. If I am understanding the current options proposed here I guess this would have been the equivalent of doing your option 1, vm migration initiated by the WMCS team.

The main difference I can see this time is that the migration will be introducing an IPv6 stack and address to each instance. This may introduce new security concerns for the instance depending on how security groups and firewalls are configured. That at least is my understanding of the "risk of introducing IPv6 without control for systems that may break if not ready" con currently identified.

For option 2, the "will delay completion of the network migration" con seems potentially long lived. We fairly recently finished a Debian OS forced migration which will have created a number of new instances that folks might reasonably expect not to need to replace for two years under normal circumstances.

In T377467#10251408, @bd808 wrote:

Thanks for digging those scripts up @aborrero. If I am understanding the current options proposed here I guess this would have been the equivalent of doing your option 1, vm migration initiated by the WMCS team.

The main difference I can see this time is that the migration will be introducing an IPv6 stack and address to each instance. This may introduce new security concerns for the instance depending on how security groups and firewalls are configured. That at least is my understanding of the "risk of introducing IPv6 without control for systems that may break if not ready" con currently identified.

For option 2, the "will delay completion of the network migration" con seems potentially long lived. We fairly recently finished a Debian OS forced migration which will have created a number of new instances that folks might reasonably expect not to need to replace for two years under normal circumstances.

I believe all your assessments are correct.

Regarding migration time, 2 years is something I personally can live with, specially because it frees resources on our side (engineering time), so we can focus on other stuff while the transition happens "on its own".

So, taking into account the previous 3 horizon choices from https://phabricator.wikimedia.org/T377467#10251090 this could be a timeline:

2024-12-01: announcement about the transition. 3 options available in horizon: VLAN/legacy, VXLAN/IPv4-only, VXLAN/IPv6-dualstack
2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.
[ .. from this point on the migration is progressing organically .. ]
2026-12-01: (2 years later) we expect no VMs in the legacy VLAN to exist. If some exist, we will evaluate what to do.
20XX-XX-XX: (at some point TBD) we may want to disable VXLAN/IPv4-only VM creation options, or keep it only for special cases upon requests.

Because VLAN/legacy and VXLAN/IPv4-only are pretty much compatible in all levels, no special changes should be needed in the migration beyond a VM rebuild. In particular, a VM in VLAN/legacy should be able to operate with one in VXLAN/IPv4-only without changes anywhere (or very little changes, for example if the subnet CIDR was hardcoded somewhere).

As side benefit, the 2 year timeline allows ourselves to work on Toolforge Kubernetes IPv6 support, there could be dragons.

bd808 wrote:

The main difference I can see this time is that the migration will be introducing an IPv6 stack and address to each instance. This may introduce new security concerns for the instance depending on how security groups and firewalls are configured. That at least is my understanding of the "risk of introducing IPv6 without control for systems that may break if not ready"

Would it be possible to (centrally / scripted) create a new security group for all projects that does nothing but drop all IPv6 traffic, then assign the new v6 IPs to instances and then tell users they should decide themselves when to edit or disable this group?

Since there is no IPv6 now we should be able to assume it can't break anything existing to drop all IPv6 traffic?

In T377467#10256868, @Dzahn wrote:

Would it be possible to (centrally / scripted) create a new security group for all projects that does nothing but drop all IPv6 traffic, then assign the new v6 IPs to instances and then tell users they should decide themselves when to edit or disable this group?

Since there is no IPv6 now we should be able to assume it can't break anything existing to drop all IPv6 traffic?

Current IPv6 firewalling semantics are here: T374714: openstack: clarify IPv6 firewalling in particular https://phabricator.wikimedia.org/T374714#10233969 which may, at least partially, match your suggestion.

In T377467#10256868, @Dzahn wrote:

Would it be possible to (centrally / scripted) create a new security group for all projects that does nothing but drop all IPv6 traffic, then assign the new v6 IPs to instances and then tell users they should decide themselves when to edit or disable this group?

Since there is no IPv6 now we should be able to assume it can't break anything existing to drop all IPv6 traffic?

I think a problem there might be that most operating systems, if they have a global v6 address, will try to use that by default before falling back to IPv4. For web browsers that fallback happens very quickly (thanks to Happy Eyeballs RFC), but for other things the delay may not be at all desirable.

So giving hosts IPv6 addresses, but blocking their v6 access upstream, may not work out so cleanly.

In my view a default security group that allowed outbound traffic, but dropped any connections initiated from outside, is an acceptable approach. Instances would have the same outbound access already over IPv4, so I don't think it shifts the security posture much.

I don't have a strong opinion, mostly questions xd

For what I understand, trying to summarize what I see in the task (and my pre-conceptions):

how much effort we want to put into it

This will depend I guess on the benefits behind it.

What do we get out of having people running on dual stack?
What benefits get people out of being on IPv6? As in, how would you sell the idea of moving to IPv6 to a someone that has two VMs with two versions of a web service + DB? (my guess to the most common user profile xd)

how fast we want the migration to happen

We have tickets that were planned for <1year that took >10years, so I'd still be somewhat generous on estimating the time it will take in total.
No hurry, the pain here will come from trying to maintain the hybrid solution until we get people migrated.

In that sense, any estimation on how much is that effort? It seems it would not be too much?

what level of disruption is acceptable for our users

As little as possible (specially after they just went on an OS upgrade round). I guess that we could pair it with the next fleet-wide OS upgrades?

how confident we are that everything will just "work", i.e, Toolforge migrating to IPv6, there be dragons

This would be little, as it's expected that some things will not work as expected/require changes that we don't currently know about.

So I see some cases depending on the answers above:

High maintenance cost/high benefit for us, high benefit for users -> best case, Option 2 with a dedicated effort on our side to help users migrate, pushing towards Option 3 even
High maintenance cost/high benefit for us, low benefit for users -> Option 1, probably 3 with the few users that want to do the migration

High maintenance cost/low benefit for us, high benefit for users -> Option 2
High maintenance cost/low benefit for us, low benefit for users -> Option 3

Low maintenance cost/high benefit for us, high benefit for users -> Option 2
Low maintenance cost/high benefit for us, low benefit for users -> Option 3

Low maintenance cost/low benefit for us, high benefit for users -> Option 2
Low maintenance cost/low benefit for us, low benefit for users -> Option 2, probably option 3 eventually (as there would be no reason for users to migrate)

That seems to simplify as:

If it has low benefit/appeal for users, Option 3
Otherwise Option 2

thanks for chiming in, some replies inline.

In T377467#10267254, @dcaro wrote:
how much effort we want to put into it
This will depend I guess on the benefits behind it.

What do we get out of having people running on dual stack?

What benefits get people out of being on IPv6? As in, how would you sell the idea of moving to IPv6 to a someone that has two VMs with two versions of a web service + DB? (my guess to the most common user profile xd)

There is definitely a list of benefits of using native IPv6, from the point of view of 'internet citizens' and/or generic network users that I wont replicate here.

For Cloud VPS in particular, I see at least the following:

we advance in the old, long running goal of better identifying cloud users interacting with the wikis. There are no NATs involved, so there is better granularity or visibility on the production side regarding the "origin" of a given connection. This will be specially relevant when we start doing IPv6 inside Toolforge kubernetes.
the above point affects things like rate limits and other stuff that we have seen in the past.
we can rethink how we do floating IPs. If you want to expose a service to the internet, you can just use your IPv6.
we can rethink how we do things like bastions, proxies and other shared services that may or may not be the same if IPv6 is involved. This could potentially mean a simplification of some of the things we do.

I think at least some Cloud VPS users will see, or be directly interested in [some of] these benefits as well.

how fast we want the migration to happen
We have tickets that were planned for <1year that took >10years, so I'd still be somewhat generous on estimating the time it will take in total.
No hurry, the pain here will come from trying to maintain the hybrid solution until we get people migrated.

In that sense, any estimation on how much is that effort? It seems it would not be too much?

I agree. In https://phabricator.wikimedia.org/T377467#10253404 I shared a potential 2-year timeline. I'm comfortable with that timeframe, and I'm hoping that you all are as well.

Regarding the effort for the hybrid solution, there is nothing specially complex to craft in order to support a "hybrid solution".
I don't think it will be a big deal overall, because is just mostly about neutron definitions.

That seems to simplify as:

If it has low benefit/appeal for users, Option 3

Otherwise Option 2

I think we could go with a plan/timeline similar to https://phabricator.wikimedia.org/T377467#10253404 and then re-evaluate in the middle of the migration period, to see if we need to "push forward" the migration a bit.

aborrero updated the task description. (Show Details)Oct 28 2024, 12:27 PM

2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.

Why do we want to continue to offer the option of creating VMs without v6 connectivity?

In T377467#10267551, @taavi wrote:

2025-02-01: (2 months later) option to create VMs in VLAN/legacy is disabled in horizon. Just VXLAN/IPv4-only or VXLAN/IPv6-dualstack remain available in horizon.

Why do we want to continue to offer the option of creating VMs without v6 connectivity?

I anticipate some projects may benefit from not having an IPv6 at all until ready for that.

An example could be Toolforge. There could be others.

We could, of course, set a timeline for when VXLAN/IPv4-only support will be removed. But I don't really care about that one, because at that point VMs are on the VXLAN subnet, which is also an important point here.

In T377467#10267254, @dcaro wrote:
how much effort we want to put into it
This will depend I guess on the benefits behind it.

What do we get out of having people running on dual stack?

What benefits get people out of being on IPv6? As in, how would you sell the idea of moving to IPv6 to a someone that has two VMs with two versions of a web service + DB? (my guess to the most common user profile xd)
There is definitely a list of benefits of using native IPv6, from the point of view of 'internet citizens' and/or generic network users that I wont replicate here.

For Cloud VPS in particular, I see at least the following:

we advance in the old, long running goal of better identifying cloud users interacting with the wikis. There are no NATs involved, so there is better granularity or visibility on the production side regarding the "origin" of a given connection. This will be specially relevant when we start doing IPv6 inside Toolforge kubernetes.

the above point affects things like rate limits and other stuff that we have seen in the past.

we can rethink how we do floating IPs. If you want to expose a service to the internet, you can just use your IPv6.

we can rethink how we do things like bastions, proxies and other shared services that may or may not be the same if IPv6 is involved. This could potentially mean a simplification of some of the things we do.

I think at least some Cloud VPS users will see, or be directly interested in [some of] these benefits as well.

I think we could go with a plan/timeline similar to https://phabricator.wikimedia.org/T377467#10253404 and then re-evaluate in the middle of the migration period, to see if we need to "push forward" the migration a bit.

I don't see any appealing benefit for most users among any of the benefits there, so that approach sounds good to me.

As someone using cloud VPS with puppet the one major advantage for users like me would be that puppet roles could "just work" like they do in production.

So we don't have to keep introducing code to do things differently or skip things (aka the frowned upon "if $realm"-checks) just because they are in cloud and there is no IPv6. It's quite common that things initially don't work because of this difference between prod and a test environment and everything that is "different in cloud anyways" adds a little bit of friction or possible frustration.

In T377467#10267530, @aborrero wrote:

we can rethink how we do floating IPs. If you want to expose a service to the internet, you can just use your IPv6.

I think this one could also be seen as a con from the point of view of end-user privacy protections if it becomes trivial to bypass the shared HTTPS reverse proxy. That proxy hides the visiting user's IPv4 from the upstream HTTP service. There are a small number of projects that we have allowed to see the visitor's true IPv4, but that is also an auditable set of projects.

I am generally excited to see IPv6 moving forward and I do think there are good reasons to have it, but most of them feel like good reasons from the point of view of providing better granularity in tracking which project/instance traffic originated from.

In T377467#10270652, @bd808 wrote:

There are a small number of projects that we have allowed to see the visitor's true IPv4, but that is also an auditable set of projects.

For a couple of them, we would want their true IPv6 if that is where the traffic originates from.

In T377467#10270652, @bd808 wrote:

In T377467#10267530, @aborrero wrote:

we can rethink how we do floating IPs. If you want to expose a service to the internet, you can just use your IPv6.

I think this one could also be seen as a con from the point of view of end-user privacy protections if it becomes trivial to bypass the shared HTTPS reverse proxy. That proxy hides the visiting user's IPv4 from the upstream HTTP service. There are a small number of projects that we have allowed to see the visitor's true IPv4, but that is also an auditable set of projects.

The simplest solution that comes to mind if we wanted to force standard HTTP (tcp/80) and HTTPS (tcp/443) ingress IPv6 traffic to only be allowed via the shared nova proxy, is to set policies in the network edge.
We would need to add IPv6 support for the shared nova proxy, but I don't think that would be a big deal.

This would not prevent someone from running an HTTP server on IPv6/tcp/8080 for example, but our TOU still apply anyway.

I think there is some agreement on option 2 being the best course of action.

I'll leave the ticket open a few more days in case there is any last minute comments.