Page MenuHomePhabricator

Move IP gateways for codfw row C/D vlans to EVPN Anycast GW
Closed, ResolvedPublic

Description

As part of the codfw row C/D switch upgrade/migration we need to move the IP gateway for the vlans in those rows from the core routers (w/VRRP) to the new L3 switches (using EVPN Anycast GW).

This can be completed at any stage after we T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches.

For IPv6 the process is much easier, as we use router-advertisements to control what lladdr / MAC is used as gateway. The beauty of which is that both the CRs and switches can be sending RAs at the same time, and hosts will use one or other but still things work. Obviously we don't want a long overlap but it means we can do things gracefully. Once we start sending the RAs from the Spines we can disable advertisements on the CRs, so the servers will transition to using the Spine GW when the time limit expires on the last RA they got from CRs.

For IPv4 the situation is trickier, as the hosts will resolve the MAC for their gateway using ARP, and cache it. If we have an overlap where both the new switches and CRs have the same IP configured hosts will randomly get one or other back in an ARP response. It's not like the v6 situation where they will receive and process the RAs from both the CRs and Spines.

The solution followed previously was to use a trick by adding routes via cumin (thankfully which will use v6 to get to hosts) to migrate temporarily to a new GW IP (only on Spines), then move the actual one from the CRs:

https://wikitech.wikimedia.org/wiki/Migrate_from_VC_switch_stack_to_EVPN#Migrate_IP_Gateways

Event Timeline

cmooney triaged this task as Medium priority.

Change #1055283 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Enable BGP between cr1-codfw and ssw1-d1-codfw

https://gerrit.wikimedia.org/r/1055283

Change #1055283 merged by jenkins-bot:

[operations/homer/public@master] Enable BGP between cr1-codfw and ssw1-d1-codfw

https://gerrit.wikimedia.org/r/1055283

Icinga downtime and Alertmanager silence (ID=d6a640fd-d19e-4aa8-930d-6c260b51a4c3) set by cmooney@cumin1002 for 3:00:00 on 4 host(s) and their services with reason: Migrate codfw row c and d IP GWs from CRs to Spines

ssw1-a[1,8]-codfw.mgmt,ssw1-d[1,8]-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-07-18T19:34:26Z] <topranks> disable BGP between spine switches in rows A and row D prior to enabling IP GW (T369274)

Mentioned in SAL (#wikimedia-operations) [2024-07-18T19:37:14Z] <topranks> add IRB int on public1-c-codfw vlan to ssw1-d1-codfw and ssw1-d8-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T20:04:49Z] <topranks> enabling IPv6 RA generation for public1-c-codfw on ssw1-d1-codfw and ssw1-d8-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T20:49:57Z] <topranks> remove VRRP for public1-c-codfw vlan from cr1-codfw and cr2-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T21:21:58Z] <topranks> enable IPv6 RA generation on ssw1-d1-codfw and ssw1-d8-codfw for public1-d-codfw vlan T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T21:39:39Z] <topranks> disable IPv6 RA generation on cr1-codfw and cr2-codfw for public1-d-codfw vlan T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T21:58:40Z] <topranks> remove VRRP group on cr1-codfw and cr2-codfw for public1-d-codfw vlan T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T22:03:33Z] <topranks> move GW IPs for public1-d-codfw vlan to ssw1-d1-codfw and ssw1-d8-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T22:33:42Z] <topranks> Disable IPv6 RA generation for private1-c-codfw vlan on cr1-codfw and cr2-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T23:17:43Z] <topranks> enable IPv6 RA generation for private1-d-codfw vlan from ssw1-d1-codfw and ssw1-d8-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T23:31:35Z] <topranks> disable IPv6 RA generation for private1-d-codfw vlan on cr1-codfw and cr2-codfw T369274

Mentioned in SAL (#wikimedia-operations) [2024-07-18T23:46:26Z] <topranks> move IP GW for vlan private1-d-codfw to ssw1-d1-codfw and ssw1-d8-codfw T369274

Change #1055309 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disable IPv6 RAs in codfw for row c and d vlans on CRs

https://gerrit.wikimedia.org/r/1055309

Change #1055309 merged by jenkins-bot:

[operations/homer/public@master] Disable IPv6 RAs in codfw for row c and d vlans on CRs

https://gerrit.wikimedia.org/r/1055309

Mentioned in SAL (#wikimedia-operations) [2024-07-18T23:57:07Z] <topranks> re-enable ssw<->ssw bgp in codfw to move east-west traffic away from CRs T369274

Change #1055310 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change list of OSPF stub interfaces for codfw CRs to match new handoff

https://gerrit.wikimedia.org/r/1055310

Change #1055310 merged by jenkins-bot:

[operations/homer/public@master] Change list of OSPF stub interfaces for codfw CRs to match new handoff

https://gerrit.wikimedia.org/r/1055310

Migration complete, no issues to report.

Change #1055458 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Support SSW performing DHCP relay for hosts connected to ASW

https://gerrit.wikimedia.org/r/1055458

Change #1055458 merged by jenkins-bot:

[operations/homer/public@master] Support SSW performing DHCP relay for hosts connected to ASW

https://gerrit.wikimedia.org/r/1055458