This doesn't exactly fit with the parent task but it is related work.
Now that we have completed the move of all hosts in codfw rows A and B to the new top-of-rack switches (lsw's) we should consider changing the BGP peering from hosts on the public1-a-codfw and public1-b-codfw vlans so they peer with the top-of-rack devices rather than the core routers.
The only devices we have that match that are our DNS hosts and the DOH VMs. I'll use the dns hosts as examples in this task but the same applies to both.
Optimisation
Ultimately this makes sense to ensure we are using the most optimal path across our datacentre to get to BGP announced anycast IPs. For example a host in row B making a DNS request that goes to dns2004 will end up taking the path:
hostX -> LEAF SW -> SPINE SW -> CR -> SPINE SW -> LEAF SW -> dns2004
Whereas if dns2004 had a BGP peering with it's local LEAF SW the switching layer would then know where the VIP needed to be routed, and the packets would not need to make it to the core routers and back:
hostX -> LEAF SW -> SPINE SW -> LEAF SW -> dns2004
In fact for hosts connected to the same leaf switch traffic would simply go to it and back down to dns2004.
Making the change would also mean the core routers would no longer need a 'leg' in these vlans, as they currently do, which is needed as the hosts set their IP on that vlan as the next-hop. Removing that would also allow us to make the SPINE -> CR links proper routed L3 links, instead of a trunk port with BGP peering over an Xlink vlan & irb interface.
Anycast
This does bring up an issue regarding anycast, and load-sharing between boxes.
With the changes described further down we can ensure, at the CR level, that routes learnt directly from hosts in non-upgraded rows will have equal preference to those in upgraded rows. That means requests from the internet will be equally load-balanced across all devices by the CRs (same as now).
The change would affect internal requests, however. Currently all internal dns queries make it to the CRs, which load-balance them. If the switches learn the BGP routes directly they will instead route packets from a given host to what they see as the closest destination.
With the current hybrid setup (half the rows on L3 switches, half on old L2 setup), that means internal requests would no longer be equally spread. dns2004, in row B, would end up getting all the queries from hosts in row A/B, with requests from hosts in C/D still being equally split across 2004/2005/2006. There is no loss of redundancy, should dns2004 fail the switches in A/B will use the longer path via CRs to get to dns2005/dns2006.
So we need to decide if this imbalance for local queries is going to be an issue. Eventually with rows C/D upgraded to the new network model things improve, although the closet host will still get used. @Traffic team have you any advice on if this is going to be a problem or not?
Mechanics of change
The current Bird Anycast role doesn't seem to support what we need. In esams and drmrs we do have the dns servers peering to their top-of-rack switch, but the control of this is being done at the site level. Specifially if profile:🐦:neighbors_list is defined in hiera at a given site the IPs in that will be used, if not the role will use the default gateway IP.
For codfw rows A/B we want the latter, but we need to keep the neighbors_list with CR IPs for the hosts in rows C/D. So we would need a toggle that allows us to ignore the presence of profile:🐦:neighbors_list for specific hosts.
Equal Cost on CRs
From the CR routers the AS path length will be equal for routes learnt directly from C/D hosts and those propagated through the Spine switches in rows A/B. This is because on the CRs we have the below statement already (to deal with some quirks of our confederation setup across the WAN and ensure the local hosts are used in esams/drmrs):
set policy-options policy-statement anycast_import term anycast4 then as-path-expand last-as count 1
The specific ASNs in the paths will be different, however, and thus the CRs won't load-share by default. We need to add the following statement to the Anycast4 group so it will consider the above routes equal despite the AS paths being different (although of the same length). The same is also needed in the 'Switch' group on CRs but that's already there:
set protocols bgp group Anycast4 multipath multiple-as
There is also the problem of our route optimisation to avoid valley-free routing in the case of a SPINE->LEAF link failure (see T332781). This means the BGP MED is seen on routes learnt from the SPINE layer (equivalent to OSPF cost in the switching fabric). Even with the ASNs considered equal this would be used to tie-break, so we'd need to equalize the MEDs also. This could be done by adding a term in our policy at the Spine layer before the default one:
set policy-options as-path anycast ".* 64605$" set policy-options policy-statement core_evpn_out term overlay_routes from protocol bgp set policy-options policy-statement core_evpn_out term overlay_routes from protocol evpn set policy-options policy-statement core_evpn_out term overlay_routes from as-path anycast set policy-options policy-statement core_evpn_out term overlay_routes then accept
Is it worth it
Tbh after going through this I've mostly just created the task to document the situation. Considering the routing changes, and having to complicate things with policies, puppet changes as well as potentially reintroducing valley-routing in the case of a LEAF->SPINE failure, I think it's probably not worth it.
Happy to hear what others think, it's entirely possible and does bring some improvements. But overall I suspect things are probably left as they are for now, and we can change the BGP peering on public vlans to the leaf switches once the whole DC has been upgraded. Optimizing the routing while we have the hybrid setup seems overly complex for the result to me.
So unless anyone objects I'll set this to declined I think.