Page MenuHomePhabricator

Migrate IP gateway for public1-a-codfw to spine switches
Closed, ResolvedPublic

Description

To progress the wider migration from the old row-wide ASWs in codfw to the new EVPN-based devices, we need to move the IP gateway for existing vlans from CR routers to Spine switches.

Unfortunately, while interruption can be minimised, experience from moving the public vlans has shown that minor interruption to comms for some servers is to be expected. Hosts will not experience an interruption on IPv6, but it's likely that some will have an interruption on IPv4 traffic of between 10-60 seconds.

Codfw is currently our primary site, but we can't wait until we switch back to complete the change. It probably does make sense to depool the site in DNS for font-end connections, however.

UPDATE: In the end we waiting until after the DC switchover to complete this, as it was deemed to risky. It was completed without disruption, however, see below wikitech page describing the approach taken:

https://wikitech.wikimedia.org/wiki/Migrate_from_VC_switch_stack_to_EVPN

Event Timeline

cmooney triaged this task as Medium priority.Nov 17 2023, 2:48 PM
cmooney created this task.
cmooney lowered the priority of this task from Medium to Low.Jan 19 2024, 12:35 PM

Going to delay this for now. We have enough disruptive changes planned not to burden wider SRE with this one in the next few weeks.

We do have some SPINE->LEAF->SPINE traffic right now which is *not* good, however it's all on 100G links via empty or almost-empty LEAF devices. As we move servers from asw to lsw the traffic pattern disappears also, as the LEAF will select the correct SPINE (connected to VRRP active CR for the vlan).

At that point the only downside is traffic going LEAF->SPINE->CR>SPINE->LEAF for intra-subnet traffic (similar to it's always been on ASW with CR as gateway). Moving the gateway to anycast on the LEAF devices would turn this to LEAF->SPINE->LEAF or turn-around within a leaf if local. So much better. But we can review once servers are moved and based on how long moving hosts from row-wide to rack-specific vlans is taking whether it's worth it. If we manage to re-ip some of the more "fragile" hosts the gateway move might be easier to execute given the interruption will only be short.

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:01:29Z] <topranks> adding routes to codfw row a hosts towards spine switch IPs on private1-a-codfw T351532

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:06:00Z] <topranks> deleting VRRP GW for 10.192.0.1 / private1-a-codfw from codfw core routers and adding to leaf switches row A T351532

cmooney updated the task description. (Show Details)