Page MenuHomePhabricator

Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment
Closed, ResolvedPublic

Description

Plan

The goal will be to clear all BGP EVPN sessions on ssw1-d1-eqiad to force new vxlan tunnel establishment, which will get around the problem we've had with some switches not correctly relaying DHCP packets.

By draining the spine switch we can do this work without disrupting comms to any servers, and so will not need to organise maintenances for each switch with relevant SRE teams.

Date/Time

Planning to do this work at 10am UTC Tuesday March 17th.

Process
  • Set cr2 to be VRRP master for all vlans
    • This will ensure row a/b hosts send traffic to cr2, which will then route to ssw1-d8
    • It will ensure that only ssw1-d8 learns the VRRP GW MAC for the row c/d vlans, so leaf switches will not see route to it from ssw1-d1
  • Disable VRRP for row-wide vlan sub-interfaces of cr1-eqiad et-1/0/5 - P89818
    • This is needed as we don't want to create a VRRP "split brain" scenario (P89795)
  • Disable the EVPN IBGP peering between ssw1-d1 and ssw1-d8:
    • ssw1-d1: set / network-instance default protocols bgp neighbor 10.64.128.18 admin-state disable
    • This ensures that ssw1-d8 does not reflect routes from ssw1-d1 to leafs
    • Which means clearing ssw1-d1 BGP session to leaf will remove all routes using it as next-hop
  • Increase the OSPF cost on the far-side of all transport links terminating on cr1
    • This will ensure traffic from other sites to row c/d vlans should instead arrive on cr2, and take path out via ssw1-d8
  • Adjust the ssw1-d1 BGP config to not accept or announce any routes to cr1 or other row e/f spines
    • By changing the import/export policies to 'NONE' - P89816
  • Adjust the cr1 BGP policy for row e/f and cloudsw to not export directly connected routes
    • cr1-eqiad: delete policy-options policy-statement Switch_out term direct
    • This ensures no L3 switches will use cr1 to get to row c/d vlans, instead they will use cr2 uplink

Result

At this point we should be able to observe the graphs and see traffic reduced to zero on the cr1 -> ssw1-d1 link. Because:

  • Traffic from rows a/b will use cr2 as gateway, due to VRRP, and it will use link to ssw1-d8
  • Traffic from rows e/f will use cr2 to get to rows c/d, as we stopped exporting "direct" routes from cr1 in BGP
  • Traffic from remote sites will route to cr2 over WAN
  • Traffic to c/d per-rack vlans will route to cr2, as cr1 no longer receives them in BGP due to policy change
  • Traffic for CR IP gateways will route to ssw1-d8 from every leaf, as that is where VRRP MAC is learnt
  • Traffic to external IP destinations from c/d per-rack vlans will route to ssw1-d8 from leafs, as those ranges are not being accepted by ssw1-d1 in BGP
  • Traffic between c/d row-wide vlans will use cr2 as gateway, which will hairpin it back down through its link to ssw1-d8

We will still have a vxlan tunnel to ssw1-d1 on every leaf, but this should only be due to the unicast MAC addresses learnt on that spine from the CRs. We should check these are the only routes known with the spine next-hop:

# NOTE: spacing for the grep might be different based on SRL column widths, we want to grep for the IP in the 'next-hop' column
show network-instance default protocols bgp routes evpn route-type 2 summary | grep "| 0      | 10.64.128.17 "
show network-instance default protocols bgp routes evpn route-type 5 summary | grep "| 0      | 10.64.128.17 "

Event Timeline

cmooney triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2026-03-17T09:06:08Z] <topranks> increase VRRP priority on eqiad vlans on CR2 to shift active gateway to cr2-eqiad T420180

Mentioned in SAL (#wikimedia-sre) [2026-03-17T10:05:20Z] <topranks> disabling VRRP for et-1/0/5 sub-interfaces on cr1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T10:25:34Z] <topranks> disable EVPN IBGP peering between ssw1-d1-eqiad and ssw1-d8-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T10:29:26Z] <topranks> stop announcing directly connected routes to L3 switches from cr1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T10:42:18Z] <topranks> cease announcing routed networks from ssw1-d1-eqiad to cr1-eqiad in BGP T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T10:58:09Z] <topranks> prepend external BGP announcements from cr1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:24:40Z] <topranks> reduce local-preference for BGP routes learnt from servers on cr1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:39:09Z] <topranks> stop accepting external routes on ssw1-d1-eqiad from cr1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:53:22Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-d1-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:54:23Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-d3-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:56:22Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-c2-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:58:35Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-c3-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T11:59:16Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-c4-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T12:00:03Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-c6-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T12:02:59Z] <topranks> reset BGP session to ssw1-d1-eiqad from lsw1-c7-eqiad T420180

Mentioned in SAL (#wikimedia-operations) [2026-03-17T12:13:01Z] <topranks> restart BGP announcements from ssw1-d1-eqiad following change T420180

Work is all complete, BGP sessions to ssw1-d1-eiqad were reset on these switches which all had tunnels with ID 1 towards it, no packet loss to servers was detected:

lsw1-c2-eqiad
lsw1-c3-eqiad
lsw1-c4-eqiad
lsw1-c6-eqiad
lsw1-c7-eqiad
lsw1-d1-eqiad
lsw1-d3-eqiad
lsw1-d8-eqiad

There were a few steps I had to complete which were not in the steps in the description:

  1. Needed to set graceful-shutdown sender for our external internet peerings, and also pre-pend our AS out
    1. This is because traffic for public1-c-eqiad and public1-d-eqiad vlans was arriving on cr1-eqiad, and using its direct link
  2. Needed to set graceful-shutdown sender for host BGP peer groups on cr1-eqiad
    1. This is to ensure to lower the local-pref for routes learnt from hosts on cr1, so it will instead use the path via cr2 for those destinations

I'll add those steps to the task for the ssw1-d8-eqiad drain.

Mentioned in SAL (#wikimedia-operations) [2026-03-17T15:02:40Z] <topranks> reset BGP session to ssw1-d8-eiqad from lsw1-d4-eqiad T420180