Page MenuHomePhabricator

Switch BGP (EVPN) topology between rows/spines at core sites
Closed, ResolvedPublic

Description

True-to-form this task has way too much background info! The actual work required should be fairly straightfoward but want to lay it out and make sure everyone is happy with the direction.

With the new switches being prepped in codfw for row's C and D we have to tackle the question of how we interconnect them together.

Physical Topology

In brief there are a few options (as discussed in the design doc):

Keep using Core Routers as 'super spines'

This seems wasteful given the cost of CR ports, and the availability of 100G links directly on our Spines.

Add new dedicated 'super spines'

A point may come where we need to do this to provide sufficient bandwdith east-west across the datacentre as usage grows. Or if our physical footprint expands so we have more rows, and more Spines. But at current usage levels it's not needed, and we have not planned for it or purchased any equipment.

Connect every Leaf to every Spine

In a pinch we could do this in codfw, as not all rows have 8 racks. It would create a simple 2-tier clos network where Leaf switches in server racks have 4x100G uplinks, one to each Spine. It involves a whole lot of new cabling, and uses up most of the ports on our Spines restricting us from adding more upstream connections to the core routers. It's not possible at all in Eqiad as we have more than 31 racks so there are not enough ports on each Spine to connect every Leaf.

Use peer links between Spines

This option keeps the Leaf <-> Spine links the same as if we had a 'super spine' layer, with two Spines aggregating the traffic for 2 rows, or up to 16 racks. Instead of adding another tier to the Clos, however, we directly connect the Spines in a full mesh.

With the current number of rows in Eqiad we're planning for 6 Spines in total, so creating a mesh like this allows for up to 3x100G between each (1.2T to remote rows), while leaving 4x100G for core router uplinks (or potentially transport circuits).

image.png (777×898 px, 542 KB)

In the first instance we will deploy 1x100G from each Spine to each Spine in remote rows. That falls short of the ideal Spine/Leaf with no over-subscription between layers (excluding the Leaf->Server one). But it provides 400G total between every pair or rows, which is more than double what is in place at the moment on the VC -> core router uplinks. We also have room to grow by increasing the number of Spine<->Spine links, and if we outgrow it physically we can logically move to a 'super spine' setup without a redesign.

Logical Topology

The question is then what way we want to configure the devices in terms of routing.

Current setup

To recap the existing setup, for instance in codfw row A/B, we have:

  • 2 Spine switches, 14 Leaf switches
  • Every Leaf has 2 uplinks, one to each Spine
  • All running OSPF in a single area 0, distributing 'underlay' IPv4 link prefixes and loopbacks
  • IBGP peering from each Leaf to both directly connected Spines
  • Spines acting as route-reflectors so propagating routes learnt from one Leaf to another
  • Spines have no direct physical connection, but do have a multi-hop IBGP session
    • Each Spine uses a different route-reflector cluster id, so they accept routes they send to each other
    • This covers the edge-case failure where one Leaf gets disconnected from Spine1, and another gets disconnected from Spine2
    • In that case both Leaf devices will still learn all routes, and connectivity will work although we will have some valley-routing

Proposed model

In terms of how we connect the remote rows I would propose the following:

  • Extend the OSPF area 0 to include all the new devices
    • Our total devices in a DC are way below the number we would need to consider breaking it up
  • Keep the current pattern whereby each Leaf switch only has IBGP peerings to its directly connected Spines
  • Create IBGP peerings between the directly-connected Spines
  • Continue using a unique route-reflector cluster-id on each Spine, ensuring Spines announce Leaf routes to each other

Another way to think about this is that we have a pair of Spines aggregating traffic from two rows of Leafs, which act as route-reflectors for the Leaf's in those rows. The Spines learn routes from remote racks from the other Spines, who they also peer with, and announce those downstream to their direct route-reflector clients (Leafs).

There would be other ways to approach it. EBGP between Spines, separate IGP domains, unicast peering between Spines rather than EVPN etc. But for me the above approach seems simpler and more flexible.

The automation changes from the current setup - where at a given site we just have a list of route-reflectors and clients - are minimal. We just need to tweak things so the Spines only configure the Leaf switches they are connected to as RR clients, rather than the current setup where there is a single site-wide list of RR clients that gets configured. I'd propose we make some simple grouping in the YAML for this.

Event Timeline

cmooney created this task.

Change #1032505 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Set AS number for BGP EVPN devices globally at site level

https://gerrit.wikimedia.org/r/1032505

cmooney raised the priority of this task from Low to Medium.May 16 2024, 3:15 PM

Change #1032505 merged by jenkins-bot:

[operations/homer/public@master] Set AS number for BGP EVPN devices globally at site level

https://gerrit.wikimedia.org/r/1032505

Change #1034889 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change EVPN BGP YAML to group into clusters and add codfw switches

https://gerrit.wikimedia.org/r/1034889

Change #1034889 merged by jenkins-bot:

[operations/homer/public@master] Change EVPN BGP YAML to group into clusters and add codfw switches

https://gerrit.wikimedia.org/r/1034889

Folk seem happy enough with this approach so I'll close this, automation has been added and working.