Page MenuHomePhabricator

Cloud IPv6 subnets
Open, Stalled, MediumPublic

Description

Follow up from T184209

Looking at codfw (but it's similar in eqiad)

We currently use the following IPv6 for the labs cloud ranges:
ae2.2122 - labs-support1-b-codfw - 2620:0:860:122::/64
ae2.2118 - labs-hosts1-b-codfw - 2620:0:860:118::/64
ae2.2120 - labs-instance-transport1-b-codfw - 2620:0:860:120::/64

The reason was probably including part of the vlan ID in the IP.
But this falls into the larger subnet 2620:0:860:100::/56 - codfw private

It's not an issue right now, especially as cloud doesn't use much IPv6, but might be an issue in the future.

I see 2 options:
1/ use a different /56
For example:

2620:0:860:200::/56  - labs-codfw
2620:0:861:200::/56  - labs-eqiad

2/ use dedicated /48s

2a02:ec80:0::/44 - labs (16 * /48) (can be shrinked to a /45)
    2a02:ec80:0::/48 - labs eqiad
        XXXX
    2a02:ec80:1::/48 - labs codfw
        2a02:ec80:1:2122::/64 - 2122 - labs-support1-b-codfw  (84A)
        2a02:ec80:1:2118::/64 - 2118 - labs-hosts1-b-codfw  (846)
        2a02:ec80:1:2120::/64 - 2120 - labs-instance-transport1-b-codfw  (848)

Having the vlanID in decimal in the IP makes it easier to understand, but we can also use the hex value (2122->84A) so it's more accurate.

1/ is more of a short term solution while 2/ will require more work (advertise new /48s to the world) but is the most sustainable option.

Event Timeline

ayounsi triaged this task as Medium priority.Feb 21 2018, 7:21 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I like option 2) the most. Are those ranges actual data?

Regarding coding the vlan id: I don't think we should do it. We might eventually move away from the prod VLAN thing, or have addresses where the VLAN part is meaningless (think of virtual networking inside the cloud itself, like VMs or virtual routers).

Will all this in mind, I suggest we use this addressing plan

2a02:ec80:0::/44 - cloud (16 * /48)
    2a02:ec80:0::/48 - cloud eqiad1 (16 * /64)
        2a02:ec80:0:0::/64 - cloud-physical-eqiad1 -- includes physical transport networks (may have more than one), physical servers, virtual IP addresses for physical servers, and whatever we may need that correspond to physical hardware
            2a02:ec80:0:0:0::/80 - cloud-upstream1-eqiad1 -- physical connectivity between our external physical router and the prod core routers (example of thing that might happen sooner than later)
            2a02:ec80:0:0:1::/80 - cloud-transport1-eqiad1 -- physical connectivity between neutron and our external physical routers
            2a02:ec80:0:0:2::/80 - cloud-hosts1-eqiad1 -- physical connectivity for servers and supporting services, a subnet connected to our external physical router
        2a02:ec80:0:1::/64 - cloud-virtual-eqiad1  -- everything from neutron virtual routers to VMs, including virtual addresses inside openstack and other virtual services.
    2a02:ec80:1::/48 - cloud codfw1dev
        2a02:ec80:1:0::/64 - cloud-physical-codfw1dev -- (see eqiad1 equivalent)
        2a02:ec80:1:1::/64 - cloud-virtual-codfw1dev  -- (see eqiad1 equivalent)

I agree that option 2 is the way to go.

The complication is how to subnet them properly for both the short term (T245495 PoC) and the longer term. I couldn't find much subnetting recommendation doc in my little research.
While keeping in mind v6 subnetting convention (eg. nothing smaller than a /64).

For example we take eqiad's:

2a02:ec80:0::/48
    2a02:ec80:0::/49
        2a02:ec80::/56 - infrastructure and support networks (gives 256*/64)
        2a02:ec80:0:100::/56 - virtual networks (gives 256*/64)
            2a02:ec80:0:100::/64 - eg VMs flat network (similar to the the 172.16.0.0/21network)
    2a02:ec80:0:8000::/49 - reserved for future use

Which is very similar to your proposal, but with different mask lengths.

For now the first one would not be used (afaik) but will be if we move to a model where the whole cloud infra is behind its dedicated gear.
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh#intermediate_router/firewall

Re-assigning to @faidon for approval as we're talking about long time design and a lots of IPs (see also T245495 for context)

I agree on option 2 above that it makes sense to assign a /48 for cloud services at each site. Some people these days are assigning a /64 per-VM so we should provide space to cater for potential future cases such as that.

Ideally, we'd be able to assign the new cloud /48s from aggregates already announced at each site. But that's not possible.

If we need to announce new space it occurs to me that, rather than adding more /48s to the v6 table, we could instead use some of the RIPE space and allocate (for example,) a new /40 for each of our sites? We have 2,048 x /40 available in our RIPE allocation, so assigning one per site leaves plenty for future POPs. We'd then have 256 x /48s for use internally at each site, the first going to cloud services. We may never need another /48 at any of them, but if we do it makes it simpler. And if we don't it doesn't matter, v6 space is designed to be wasted.

Also, the way we've divided our current v6 allocations is based on geography. I think it makes sense to stick with that, rather than have some subnets at the first level under our RIR allocation done on a geographical basis, and some on a category/service basis (i.e. /44 for cloud services).

@aborrero I agree with Arzhel that no networks smaller than a /64 should be used. Your maths is slightly off though, there are 65,536 /64s in a /48, so plenty of space. I'd also not divide the /48 directly into /64s, add a layer of hierarchy under the /48. Maybe sparsely allocate (i.e. leaving gaps for future growth) some of the 16 x /52s, each earmarked for a different use case (i.e. vlan-attached networks, infrastructure, VMs/containers etc.) I'd avoid using /49s etc we've enough space to avoid it so segment at nibble boundaries.

FWIW a good guide on this is Tom Coffen's "IPv6 Address Planning": https://www.oreilly.com/library/view/ipv6-address-planning/9781491908211/

sparse_slash_52.png (667×893 px, 133 KB)

random_example_plan.png (868×893 px, 410 KB)

Ok, so the plan would be to have:

  • 2a02:ec80:0::/48 - cloud eqiad1
  • 2a02:ec80:1::/48 - cloud codfw1dev

Please confirm and request approvals as required.

My own preference would be to allocate larger ranges to each site as mentioned above, and allocate the cloud prefixes from within those geographic aggregates. Doesn't have to be that way of course, I guess we can see what the consensus is.

Sticking with my previous example:

2a02:ec80:1000::/40    eqiad
2a02:ec80:2000::/40    codfw
2a02:ec80:3000::/40    esams
2a02:ec80:4000::/40    ulsfo
2a02:ec80:5000::/40    eqsin

(there are 64 /40s in the first half of our RIPE /29 if allocated like this, with a gap of 15 /40s between each for future expansion, and the upper half of the /29 untouched).

So then possibly:

2a02:ec80:1001::/48	cloud eqiad1
2a02:ec80:2001::/48     cloud codfw1dev

Leave it to us to discuss and we will get back to you.

Having discussed with @ayounsi we were thinking it may be better to assign aggregates less sparsely, as follows:

2a02:ec80:100::/40	eqiad
2a02:ec80:200::/40	codfw
2a02:ec80:300::/40	esams
2a02:ec80:400::/40	ulsfo
2a02:ec80:500::/40	eqsin

This would allow us to move beyond "site 9" and have 2a02:ec80:1000::/40, 2a02:ec80:1100::/40 and 2a02:ec80:1200::/40 for sites 10, 11 and 12 respectively (and up to '99'). It does remove the gaps between blocks / room to expand, but given the size of each /40 this is probably never going to be an issue.

In that circumstance we could allocate these for cloud:

2a02:ec80:101::/48	cloud eqiad1
2a02:ec80:201::/48      cloud codfw1dev

But again leave it with us to discuss more widely in the team, just documenting here for visibility.

@faidon wondering if you are ok with the plan set out above? Any comments / feedback welcome.

Prioritization-wise, is there a reason why we're going for an IPv6 allocation while our IPv4 segmentation is still in flux or in progress? I fear that we're adding more features/problems to the mix without having set and implemented clear boundaries first, and making an already complex situation more complex (e.g. more filters to maintain) so I'd like to hear more about those trade offs and perhaps wait.

Sizing-wise it all makes sense to me - thanks for all the details @cmooney!

What I'm not entirely sure (but not negative per se) about this space being within region's geographic aggregates. We've had different routing rules for cloud VPS in the past (cold potato, announcing the space to e.g. our peers in Amsterdam and carrying traffic in our own transport), and we've also been very deliberate in trying to keep the IP space isolated compared to our production traffic (i.e. by moving VMs out of the 208.80.152.0/22 space). Directionally, I'd like for us to be treating our public cloud like a) a customer b) any other public cloud - so perhaps it's worth thinking about one large supernet for customers, under which cloud could get a large allocation, with specific assignments for their regions (which may be hosted in our data centers, or even outside of our production data centers in the future). Happy to discuss those trade offs further though :)

Thanks @faidon for the comments. In terms of why it is being discussed, I'm trying to advance tasks outstanding for WMCS (as discussed by myself and @joanna_borun), and the IPv6 stuff seemed like something that could be progressed. Providing new blocks for cloud inevitably involves using space from our RIPE /29 allocation, hence the discussion on how to subnet it.

Understood that there is a question regarding IPv4 also, more than happy to park this for now until we have decided how to segment the IPv4. That said any eventual decision in terms of v4 should not really impact the v6 decision. We shouldn't compromise how we segment the v6 space by trying to align it to the plan for v4, which will inevitably be constrained by scarcity. Filter rules should be as close as possible though, so maybe it is best to wait.

Keeping the "customer" ranges separate from production - and the ability to implement different policy / announce them to the internet separately - makes perfect sense. That's perfectly valid, and if there is precedent let's stick with it. Keeping the goal to announce large aggregates, minimizing our total number of prefixes in the dfz, we could stick with a similar scheme but announce a /40 for customers from each site (and a separate /40 for prod if ever needed). For example allocate the first half of our /29 like this:

2a02:ec80::/32          WMF Production [Reserved for future expansion]
2a02:ec81::/32          TBD
2a02:ec82::/32          WMF Hosted Customers
2a02:ec83::/32          TBD

We could sub-allocate /40s for "customers" on a geographic basis, same scheme as before:

2a02:ec82:100::/40      eqiad customers
2a02:ec82:200::/40	codfw customers
2a02:ec82:300::/40	esams customers
2a02:ec82:400::/40	ulsfo customers
2a02:ec82:500::/40	eqsin customers

And assign specific ranges to WMCS from those as needed, again similar to before:

2a02:ec82:100::/48	cloud eqiad1
2a02:ec82:200::/48      cloud codfw1dev

If cloud eventually run their own networks we could give them addresses from the remaining space, and not touch the existing customer aggregates we announce. There's loads of room for that. I'd probably advise we go to ARIN or RIPE and get them their own AS for that too, but not something to worry about today.

@cmooney That looks cleaner indeed.

@faidon We're moving away from 172.16/12 IPs being able to reach the Wikis, which means VM traffic needs to be NATed and looses useful troubleshooting information ("which VM did edit X"?).
There are several good options to solve that (see sub tasks of T209011), one of them being using IPv6, as each VM would have a distinct IP.
That's why we shouldn't wait to solve the IPv4 segmentation before doing work on IPv6.

faidon changed the task status from Open to Stalled.Aug 27 2021, 7:03 PM

There are some ongoing conversations with the WMCS team regarding the placement of their infrastructure in our network/infrastructure, and I think it would be good to resolve that first, before moving forward on implementing this. Setting this to Stalled - hope that makes sense!