Page MenuHomePhabricator

CloudVPS: IPv6 early PoC
Open, MediumPublic

Description

I would like to conduct an early IPv6 PoC in codfw for CloudVPS.

The PoC would consist on:

First thing I need is an IPv6 range allocation for openstack @ codfw1dev (cc @ayounsi )

Event Timeline

bd808 triaged this task as Medium priority.Feb 25 2020, 5:05 PM
bd808 moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

The steps I have in mind here are:
1/ Setup v6 on the transport network (most likely 2620:0:860:fe0a::/64)
2/ Assign a /48 for cloud codfw, see T187929 (Here we won't go into its subnetting)
3/ Advertise that /48 to the world either from codfw only or both eqiad/codfw(TBD, former is preferred)
3a/ Shrink esams advertisement from 2a02:ec80::/32 to 2a02:ec80:500::/48
4/ Setup a basic firewall filter (cloud-in6)
My suggestion is to initially only allow traffic to the internet and maybe some selected "private" hosts needed for the testing. This in order to minimize the workload of creating and managing that filter during and after the POC.

5/ Statically route the whole /48 to the neutron gateway VIP (go live)
This would stay like this for simplicity for now, but will most likely change down the road (if for example support network gets moved to that prefix but on a different network)

I assume that if the PoC is not successful, we would rollback the configuration and assignments will stay as reserved for future usage.
If successful, we would need to start managing the firewall filter (like we currently do with v4) does that mean the v4 exceptions will go away but be replaced by the v6 ones or both will stay live?
Is there anything else that will change on that aspect if we keep maintaining v6 on the long run?

Mostly agree with everything, some inlined comments

The steps I have in mind here are:
1/ Setup v6 on the transport network (most likely 2620:0:860:fe0a::/64)

Is that range definitive? If no, could we pick one that would likely be used in the future too? Per my comment T187929#5984401 I suggest we use 2a02:ec80:1:0::/64.

2/ Assign a /48 for cloud codfw, see T187929 (Here we won't go into its subnetting)

OK. Same, I suggest we use 2a02:ec80:1:1::/64 (see T187929#5984401)

3/ Advertise that /48 to the world either from codfw only or both eqiad/codfw(TBD, former is preferred)
4/ Setup a basic firewall filter (cloud-in6)
My suggestion is to initially only allow traffic to the internet and maybe some selected "private" hosts needed for the testing. This in order to minimize the workload of creating and managing that filter during and after the POC.

I agree. I suggest we do this firewalling incrementally. Ideally we would implement some statefull firewalling and only allow connections started within the cloud, but I'm aware this is not possible using the prod core routers as upstream.
For the initial PoC, we only need 3 addresses allowed:

  • 2620:0:860:3:208:80:153:76 (cloudservices2002-dev.wikimedia.org)
  • 2620:0:860:2:208:80:153:59 (cloudcontrol2001-dev.wikimedia.org)
  • 2620:0:860:3:208:80:153:75 (cloudcontrol2003-dev.wikimedia.org)

i.e, connection to/from the the cloud to these physical hosts.

5/ Statically route the whole /48 to the neutron gateway VIP (go live)
This would stay like this for simplicity for now, but will most likely change down the road (if for example support network gets moved to that prefix but on a different network)

I assume that if the PoC is not successful, we would rollback the configuration and assignments will stay as reserved for future usage.

Ok, That's why I think it may be important to agree on the addressing plan first: T187929: Cloud IPv6 subnets

If successful, we would need to start managing the firewall filter (like we currently do with v4) does that mean the v4 exceptions will go away but be replaced by the v6 ones or both will stay live?

Initially both will stay alive. I'm not sure what would be our next step from the network point of view:

  • we get rid of our NAT dmz_cidr mechanism --> v4 firewall need adjustment in the prod core routers.
  • we introduce a pair of cloud upstream routers/firewalls between neutron and the prod core routers --> we can drop all the firewalling from the prod core routers.

Is there anything else that will change on that aspect if we keep maintaining v6 on the long run?

Other than what I commented already, I don't have any other concern as of right now. Things may change in the near future if we really start playing with the PoC and we start learning how this looks like for real.