Plan
The goal will be to clear all BGP EVPN sessions on ssw1-d1-eqiad to force new vxlan tunnel establishment, which will get around the problem we've had with some switches not correctly relaying DHCP packets.
By draining the spine switch we can do this work without disrupting comms to any servers, and so will not need to organise maintenances for each switch with relevant SRE teams.
Date/Time
Planning to do this work at 10am UTC Tuesday March 17th.
Process
- Set cr2 to be VRRP master for all vlans
- This will ensure row a/b hosts send traffic to cr2, which will then route to ssw1-d8
- It will ensure that only ssw1-d8 learns the VRRP GW MAC for the row c/d vlans, so leaf switches will not see route to it from ssw1-d1
- Disable VRRP for row-wide vlan sub-interfaces of cr1-eqiad et-1/0/5 - P89818
- This is needed as we don't want to create a VRRP "split brain" scenario (P89795)
- Disable the EVPN IBGP peering between ssw1-d1 and ssw1-d8:
- ssw1-d1: set / network-instance default protocols bgp neighbor 10.64.128.18 admin-state disable
- This ensures that ssw1-d8 does not reflect routes from ssw1-d1 to leafs
- Which means clearing ssw1-d1 BGP session to leaf will remove all routes using it as next-hop
- Increase the OSPF cost on the far-side of all transport links terminating on cr1
- This will ensure traffic from other sites to row c/d vlans should instead arrive on cr2, and take path out via ssw1-d8
- Adjust the ssw1-d1 BGP config to not accept or announce any routes to cr1 or other row e/f spines
- By changing the import/export policies to 'NONE' - P89816
- Adjust the cr1 BGP policy for row e/f and cloudsw to not export directly connected routes
- cr1-eqiad: delete policy-options policy-statement Switch_out term direct
- This ensures no L3 switches will use cr1 to get to row c/d vlans, instead they will use cr2 uplink
Result
At this point we should be able to observe the graphs and see traffic reduced to zero on the cr1 -> ssw1-d1 link. Because:
- Traffic from rows a/b will use cr2 as gateway, due to VRRP, and it will use link to ssw1-d8
- Traffic from rows e/f will use cr2 to get to rows c/d, as we stopped exporting "direct" routes from cr1 in BGP
- Traffic from remote sites will route to cr2 over WAN
- Traffic to c/d per-rack vlans will route to cr2, as cr1 no longer receives them in BGP due to policy change
- Traffic for CR IP gateways will route to ssw1-d8 from every leaf, as that is where VRRP MAC is learnt
- Traffic to external IP destinations from c/d per-rack vlans will route to ssw1-d8 from leafs, as those ranges are not being accepted by ssw1-d1 in BGP
- Traffic between c/d row-wide vlans will use cr2 as gateway, which will hairpin it back down through its link to ssw1-d8
We will still have a vxlan tunnel to ssw1-d1 on every leaf, but this should only be due to the unicast MAC addresses learnt on that spine from the CRs. We should check these are the only routes known with the spine next-hop:
# NOTE: spacing for the grep might be different based on SRL column widths, we want to grep for the IP in the 'next-hop' column show network-instance default protocols bgp routes evpn route-type 2 summary | grep "| 0 | 10.64.128.17 " show network-instance default protocols bgp routes evpn route-type 5 summary | grep "| 0 | 10.64.128.17 "