Page MenuHomePhabricator

Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw
Open, LowPublic

Description

This doesn't exactly fit with the parent task but it is related work.

Now that we have completed the move of all hosts in codfw rows A and B to the new top-of-rack switches (lsw's) we should consider changing the BGP peering from hosts on the public1-a-codfw and public1-b-codfw vlans so they peer with the top-of-rack devices rather than the core routers.

The only devices we have that match that are our DNS hosts and the DOH VMs. I'll use the dns hosts as examples in this task but the same applies to both.

Optimisation

Ultimately this makes sense to ensure we are using the most optimal path across our datacentre to get to BGP announced anycast IPs. For example a host in row B making a DNS request that goes to dns2004 will end up taking the path:

hostX -> LEAF SW -> SPINE SW -> CR -> SPINE SW -> LEAF SW -> dns2004

Whereas if dns2004 had a BGP peering with it's local LEAF SW the switching layer would then know where the VIP needed to be routed, and the packets would not need to make it to the core routers and back:

hostX -> LEAF SW -> SPINE SW -> LEAF SW -> dns2004

In fact for hosts connected to the same leaf switch traffic would simply go to it and back down to dns2004.

Making the change would also mean the core routers would no longer need a 'leg' in these vlans, as they currently do, which is needed as the hosts set their IP on that vlan as the next-hop. Removing that would also allow us to make the SPINE -> CR links proper routed L3 links, instead of a trunk port with BGP peering over an Xlink vlan & irb interface.

Anycast

This does bring up an issue regarding anycast, and load-sharing between boxes.

With the changes described further down we can ensure, at the CR level, that routes learnt directly from hosts in non-upgraded rows will have equal preference to those in upgraded rows. That means requests from the internet will be equally load-balanced across all devices by the CRs (same as now).

The change would affect internal requests, however. Currently all internal dns queries make it to the CRs, which load-balance them. If the switches learn the BGP routes directly they will instead route packets from a given host to what they see as the closest destination.

With the current hybrid setup (half the rows on L3 switches, half on old L2 setup), that means internal requests would no longer be equally spread. dns2004, in row B, would end up getting all the queries from hosts in row A/B, with requests from hosts in C/D still being equally split across 2004/2005/2006. There is no loss of redundancy, should dns2004 fail the switches in A/B will use the longer path via CRs to get to dns2005/dns2006.

So we need to decide if this imbalance for local queries is going to be an issue. Eventually with rows C/D upgraded to the new network model things improve, although the closet host will still get used. @Traffic team have you any advice on if this is going to be a problem or not?

Mechanics of change

The current Bird Anycast role doesn't seem to support what we need. In esams and drmrs we do have the dns servers peering to their top-of-rack switch, but the control of this is being done at the site level. Specifially if profile:🐦:neighbors_list is defined in hiera at a given site the IPs in that will be used, if not the role will use the default gateway IP.

For codfw rows A/B we want the latter, but we need to keep the neighbors_list with CR IPs for the hosts in rows C/D. So we would need a toggle that allows us to ignore the presence of profile:🐦:neighbors_list for specific hosts.

Equal Cost on CRs

From the CR routers the AS path length will be equal for routes learnt directly from C/D hosts and those propagated through the Spine switches in rows A/B. This is because on the CRs we have the below statement already (to deal with some quirks of our confederation setup across the WAN and ensure the local hosts are used in esams/drmrs):

set policy-options policy-statement anycast_import term anycast4 then as-path-expand last-as count 1

The specific ASNs in the paths will be different, however, and thus the CRs won't load-share by default. We need to add the following statement to the Anycast4 group so it will consider the above routes equal despite the AS paths being different (although of the same length). The same is also needed in the 'Switch' group on CRs but that's already there:

set protocols bgp group Anycast4 multipath multiple-as

There is also the problem of our route optimisation to avoid valley-free routing in the case of a SPINE->LEAF link failure (see T332781). This means the BGP MED is seen on routes learnt from the SPINE layer (equivalent to OSPF cost in the switching fabric). Even with the ASNs considered equal this would be used to tie-break, so we'd need to equalize the MEDs also. This could be done by adding a term in our policy at the Spine layer before the default one:

set policy-options as-path anycast ".* 64605$"
set policy-options policy-statement core_evpn_out term overlay_routes from protocol bgp
set policy-options policy-statement core_evpn_out term overlay_routes from protocol evpn
set policy-options policy-statement core_evpn_out term overlay_routes from as-path anycast
set policy-options policy-statement core_evpn_out term overlay_routes then accept

Is it worth it

Tbh after going through this I've mostly just created the task to document the situation. Considering the routing changes, and having to complicate things with policies, puppet changes as well as potentially reintroducing valley-routing in the case of a LEAF->SPINE failure, I think it's probably not worth it.

Happy to hear what others think, it's entirely possible and does bring some improvements. But overall I suspect things are probably left as they are for now, and we can change the BGP peering on public vlans to the leaf switches once the whole DC has been upgraded. Optimizing the routing while we have the hybrid setup seems overly complex for the result to me.

So unless anyone objects I'll set this to declined I think.

Event Timeline

cmooney created this task.

So we need to decide if this imbalance for local queries is going to be an issue.

I think load is the main thing to look at. I briefly thought about cold caches but if I understand correctly, all servers will keep receiving some traffic.

For codfw rows A/B we want the latter, but we need to keep the neighbors_list with CR IPs for the hosts in rows C/D. So we would need a toggle that allows us to ignore the presence of profile:🐦:neighbors_list for specific hosts.

We can define per host hiera keys, and empty lists as well, so to be tested but I don't think we need to implement a new feature

Is it worth it

Good question :) If there was no change in the router config I'd say sure. But now I'm also less so sure. As it's temporary until all the rows are upgraded, it seems to make sens to wait or at least how we tackle future public hosts.

We can define per host hiera keys, and empty lists as well, so to be tested but I don't think we need to implement a new feature

That's an option actually, remove the key from the site level and add it at the host-level for the servers in the non-migrated rows.

Is it worth it

Good question :) If there was no change in the router config I'd say sure. But now I'm also less so sure. As it's temporary until all the rows are upgraded, it seems to make sens to wait or at least how we tackle future public hosts.

Yeah I'm happy to wait, I don't think it's a major deal either way. Given codw C-D are due to be refreshed this calendar year we can hopefully tidy things up then.