Page MenuHomePhabricator

Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans
Closed, ResolvedPublic

Description

Creating this task to track the work of bringing the new EVPN switches live in codfw row A/B, and making the Spines IP gateway for the existing subnets.

As discussed in the parent task we need to move the links from the row A/B ASWs

RackASW Side (stays same)Existing PortNew Port
A8asw-a7-codfw et-7/0/52cr2-codfw et-1/0/0ssw1-a8-codfw et-0/0/29
A8asw-b7-codfw et-7/0/52cr2-codfw et-1/0/3ssw1-a8-codfw et-0/0/30
A1asw-a2-codfw et-2/0/52cr1-codfw et-1/0/0ssw1-a1-codfw et-0/0/29
A1asw-b2-codfw et-2/0/51cr1-codfw et-1/0/3ssw1-a1-codfw et-0/0/30

We need to do these 1 at a time in co-ordination between netops and dc-ops. At a high-level the plan will be:

  1. Move links in rack A8/A1 from CR to SSW
    • This brings up new trunk from existing asw vc to the spine, connecting new switches to existing vlans
    • We bridge the existing vlans through to the CR which continues to act as IP GW initially
  2. Migrate IP GW from CRs to SSW
    • This ensures traffic arriving from the asw to ssw is optimally forwarded
    • Without it the CR VRRP config causes traffic hitting ssw1-a8-codfw for the VRRP MAC to route via a LEAF to ssw1-a1-codfw to get to the active CR
    • We keep the CR sub-interfaces connecting to the legacy vlans, so that BGP next-hops are directly reachable

The basic trick when moving the GW IP is to leave VRRP running on the CRs after the first cable move, but change the virtual IP configured. This keeps the VRRP MAC operational, ensuring traffic from hosts with the old IP<->MAC binding still cached in ARP/ND table will be forwarded by the CRs.

To support this we need to change the SSW link to CRs to a layer-2 trunk on the SSW side, and BGP peer to the CRs from an IRB interface over an xlink vlan. This will allow us to trunk the existing Vlans to the CRs on the same link, so that the CRs can retain a direct connect to these networks. This ensures the CRs can still route to VIPs announced by end-hosts in BGP, which currently peer with the CR loopbacks. Without a direct leg in the Vlan the peering would work, but the routing would break as the EVPN switches do not know those routes.

Event Timeline

cmooney triaged this task as Medium priority.Sep 22 2023, 3:48 PM
cmooney created this task.

Change 960109 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Temporarily adjust EVPN outbound policy to CRs to block existing nets

https://gerrit.wikimedia.org/r/960109

Change 960109 merged by jenkins-bot:

[operations/homer/public@master] Temporarily adjust EVPN outbound policy to CRs to block existing nets

https://gerrit.wikimedia.org/r/960109

Change 961927 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add automation to define ESI-LAGs on EVPN switches

https://gerrit.wikimedia.org/r/961927

Change 961927 merged by jenkins-bot:

[operations/homer/public@master] Add automation to define ESI-LAGs on EVPN switches

https://gerrit.wikimedia.org/r/961927

@Papaul I've moved the google meet for this to the week after - Oct 17th. There are few other moving parts in the overall plan I want to fully plan before we go ahead.

Discussed with @Papaul and we will do this work on Thursday at 11.30am CDT / 16:30 UCT. Shouldn't be any interruption to connectivity but as it's a big / delicate change we need to keep close eyes on everything.

(@Papaul I moved the time forward to 12.00 to not conflict with meetings, ping me if that doesn't work).

@Papaul hoping to tackle these in this order, want to do both row A links first, then the row B links.

OrderASW Side (stays same)RackExisting PortNew Port
1asw-a7-codfw et-7/0/52A8cr2-codfw et-1/0/0ssw1-a8-codfw et-0/0/29
2asw-a2-codfw et-2/0/52A1cr1-codfw et-1/0/0ssw1-a1-codfw et-0/0/29
3asw-b7-codfw et-7/0/52A8cr2-codfw et-1/0/3ssw1-a8-codfw et-0/0/30
4asw-b2-codfw et-2/0/51A1cr1-codfw et-1/0/3ssw1-a1-codfw et-0/0/30

Change 971197 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Move mr1-codfw OSPF interface to et-1/0/0 on CRs after migration

https://gerrit.wikimedia.org/r/971197

Row A Steps Detail: P53131

Row B Steps Detail: P53132

Icinga downtime and Alertmanager silence (ID=0a8384b5-aa0d-44df-bf5c-aa9e191ed730) set by cmooney@cumin1001 for 2:00:00 on 13 host(s) and their services with reason: Move row A/B CR uplinks to SPINE switches

asw-a-codfw,asw-b-codfw,cr[1-2]-codfw,cr[1-2]-codfw IPv6,mr1-codfw,mr1-codfw IPv6,mr1-codfw.oob,re0.cr[1-2]-codfw.mgmt,ripe-atlas-codfw,ripe-atlas-codfw IPv6

Mentioned in SAL (#wikimedia-operations) [2023-11-02T17:06:44Z] <topranks> shutting down uplink from asw-a-codfw et-7/0/52 to cr2-codfw et-1/0/0 (T347191)

Change 971197 merged by jenkins-bot:

[operations/homer/public@master] Move mr1-codfw OSPF interface to et-1/1/5 on CRs after migration

https://gerrit.wikimedia.org/r/971197

Change 971257 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Fix error in interface for MR1 uplinks

https://gerrit.wikimedia.org/r/971257

Change 971257 merged by jenkins-bot:

[operations/homer/public@master] Fix error in interface for MR1 uplinks

https://gerrit.wikimedia.org/r/971257

Mentioned in SAL (#wikimedia-operations) [2023-11-02T17:45:33Z] <topranks> Moving row A outbound traffic from direct CR link to routing via Spinie (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-02T17:50:37Z] <topranks> Shutting asw-a-codfw uplink to cr1-codfw down in advance of cable move (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-02T18:07:46Z] <topranks> Making cr1-codfw VRRP Master for row A traffic again on ssw1-a1-codfw interface (T347191)

Change 971263 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change definitions for OSPF stub interfaces codfw CRs

https://gerrit.wikimedia.org/r/971263

Change 971263 merged by jenkins-bot:

[operations/homer/public@master] Change definitions for OSPF stub interfaces codfw CRs

https://gerrit.wikimedia.org/r/971263

Change 971264 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Enable DHCP relay on et-1/1/5 subints following switch move

https://gerrit.wikimedia.org/r/971264

Mentioned in SAL (#wikimedia-operations) [2023-11-02T18:21:03Z] <topranks> Shutting asw-b-codfw uplink to cr2-codfw down in advance of cable move (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-02T18:44:29Z] <topranks> Making cr2-codfw VRRP Master for row B traffic over new link from ssw1-a8-codfw (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-02T18:46:08Z] <topranks> shutting down uplink from asw-b-codfw et-2/0/51 to cr1-codfw in advance of cable move (T347191)

Change 971271 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change OSPF stub ints and dhcp relay on CRs for codfw row B

https://gerrit.wikimedia.org/r/971271

Change 971273 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Make cr1-codfw VRRP primary for rows A and B after link move

https://gerrit.wikimedia.org/r/971273

Change 971264 merged by jenkins-bot:

[operations/homer/public@master] Enable DHCP relay on et-1/1/5 subints following switch move

https://gerrit.wikimedia.org/r/971264

Change 971271 merged by jenkins-bot:

[operations/homer/public@master] Change OSPF stub ints and dhcp relay on CRs for codfw row B

https://gerrit.wikimedia.org/r/971271

Change 971273 merged by jenkins-bot:

[operations/homer/public@master] Make cr1-codfw VRRP primary for rows A and B after link move

https://gerrit.wikimedia.org/r/971273

Change 971276 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove temp filter stopping codfw ssw's announcing row a/b networks

https://gerrit.wikimedia.org/r/971276

Change 971276 merged by jenkins-bot:

[operations/homer/public@master] Remove temp filter stopping codfw ssw's announcing row a/b networks

https://gerrit.wikimedia.org/r/971276

Mentioned in SAL (#wikimedia-operations) [2023-11-03T14:40:46Z] <topranks> adding irb interface in private1-a-codfw vlan to ssw1-a1-codfw T347191

Mentioned in SAL (#wikimedia-operations) [2023-11-03T14:50:05Z] <topranks> moving cr1-codfw <-> ssw1-a1-codfw EBGP session to private1-b-codfw IPs T347191

Mentioned in SAL (#wikimedia-operations) [2023-11-09T20:12:05Z] <cmooney@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,ssw1-a8-codfw,ssw1-a8-codfw.mgmt with reason: Adjust vlans trunked to asw-a-codfw from ssw1-a8-codfw T347191

Mentioned in SAL (#wikimedia-operations) [2023-11-09T20:12:19Z] <cmooney@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,ssw1-a8-codfw,ssw1-a8-codfw.mgmt with reason: Adjust vlans trunked to asw-a-codfw from ssw1-a8-codfw T347191

Mentioned in SAL (#wikimedia-operations) [2023-11-09T20:15:42Z] <topranks> resetting asw-a-codfw et-2/0/52 to shift traffic away from ssw1-a8-codfw (T347191)

Change 973239 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's

https://gerrit.wikimedia.org/r/973239

Change 973239 merged by jenkins-bot:

[operations/homer/public@master] Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's

https://gerrit.wikimedia.org/r/973239

Mentioned in SAL (#wikimedia-operations) [2023-11-16T20:41:20Z] <topranks> adding anycast GW for public1-b-codfw vlan to codfw spine switches (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T20:54:17Z] <topranks> changing VRRP GW IP for public1-b-codfw on codfw CRs and disabling IPv6 RAs on the CRs (T347191)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T21:42:59Z] <topranks> Removing VRRP config for for public1-b-codfw on codfw CRs (T347191)

Icinga downtime and Alertmanager silence (ID=c937612c-c0eb-4c9e-a245-9810a56c0a33) set by cmooney@cumin1001 for 1:00:00 on 4 host(s) and their services with reason: Move public1-a-codfw vlan GW from codfw CR routers to ssw

cr[1-2]-codfw,cr[1-2]-codfw IPv6

Mentioned in SAL (#wikimedia-operations) [2023-11-16T23:30:26Z] <topranks> Add gateway IP for public1-a-codfw Vlan to ssw in codfw T347191

Mentioned in SAL (#wikimedia-operations) [2023-11-16T23:33:22Z] <topranks> Change VRRP IP for public1-a-codfw vlan on codfw CRs T347191

Change 975101 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change router advertisement template to set description correctly

https://gerrit.wikimedia.org/r/975101

Change 975101 merged by jenkins-bot:

[operations/homer/public@master] Change router advertisement template to set description correctly

https://gerrit.wikimedia.org/r/975101

Change 975102 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove DHCP relay config for codfw row a/b public vlans

https://gerrit.wikimedia.org/r/975102

Change 975102 merged by jenkins-bot:

[operations/homer/public@master] Remove DHCP relay config for codfw row a/b public vlans

https://gerrit.wikimedia.org/r/975102

cmooney updated the task description. (Show Details)

public1-a-codfw and public1-b-codfw have gateways have been migrated to the new setup.

Problems

Unfortunately the migration caused some momentary blips to traffic on each vlan as they were done.

  • Some hosts became unreachable over IPv4 when the VRRP VIP was changed on the CRs
    • Forcing an arp-cache clear on the hosts fixed this
    • It's unclear why forwarding stopped, on most hosts, even with old ARP entry, forwarding continued
  • BFD to the Anycast hosts went down, which in turn tore down BGP sessions
    • The quick way to resolve was a "restart all" from birdc on the affected hosts
      • Clearing the session from the CR side did not force it back up
      • Doing so on the server side immediately brought everything back up and working fine
      • There appears to be some odd bird issue here

Lessons

In terms of proceeding the private vlans have many more hosts, so we need to be careful. We will need to work with the other SRE teams to arrange a window to complete the work. Some lessons learnt for the next time include:

  • Do the IPv6 changes first
    • Due to use of RAs the "old" and "new" default route can co-exist (different next-hops) and expire gracefully
    • So enable the new anycast gw for the IPv6 IP, and enable RAs on the irb, disable RAs on the CRs
    • This ensures we always have connectivity to the end hosts over IPv6 at least
  • Depool LVS on the affected vlans in advance
    • This ensures we have no interruption of traffic to VIPs
    • Direct-return from realservers in the affected rows may still take a brief hit
  • Issue "arp -d" for the gateway IP to all servers on the vlan once the VRRP IPs are changed on the CRs
  • The only server using BFD to the CRs on the private vlans is centrallog2002
    • Do a "restart all" in birdc on this server once the VRRP IPs have been changed
  • Do the IP changes manually on the devices rather than using Homer, to reduce time variance between changes

Closing this task, everything now completed. For future rows we can base the plan on the steps outlined here:

https://wikitech.wikimedia.org/wiki/Migrate_from_VC_switch_stack_to_EVPN