Page MenuHomePhabricator

Migrate row E/F network aggregation to dedicated Spine switches
Open, MediumPublic

Assigned To
Authored By
cmooney
Nov 11 2022, 5:27 PM
Referenced Files
F35785296: LVS_Direct_Extension_NEW_RACKS.png
Nov 16 2022, 9:34 AM
F35782575: LVS_Direct_Extension_NEW_RACKS.png
Nov 15 2022, 5:20 PM
F35755084: step3.png
Nov 11 2022, 5:27 PM
F35755000: Step2.png
Nov 11 2022, 5:27 PM
F35754939: Step 1.png
Nov 11 2022, 5:27 PM
F35754899: step0.png
Nov 11 2022, 5:27 PM

Description

Due to delayed delivery times, and then an issue with licensing, lsw1-e1-eqiad and lsw1-f1-eqiad are currently acting as the aggregation / Spine switch devices for Eqiad rows E and F. In simple terms that means they are connecting to the CR routers in the other cage upstream, and to the remaining top-of-rack switches in racks E2, E3, F2 and F3.

This setup got us around the immediate issue of capacity, but we now have the Spine devices prepped, and need to move to them to standardize our topology and ensure we have an aggregation layer that can scale up to support the remaining racks in the new cage.

Migration Plan

The current device cabling is as follows:

step0.png (461×643 px, 43 KB)

Step 1 - Bring Spines into fabric

Step 1 is to use the currently free QSFP28 ports on lsw1-e1-eqiad and lsw1-f1-eqiad to connect each of them to the new spine layer, enabling OSPF and BGP EVPN to bring them into the fabric.

Step 1.png (461×643 px, 52 KB)

NOTE: This approach requires the purchase of 8 x 100G-Base CWDM4 optic modules, and 4 x LC-LC single-mode fiber optic patch cables. When the migration is complete we will free up the same number of elements, which can be used to connect some of the next 8 racks when we make them live.
Step 2 - Migrate CR uplinks to Spines

Step 2 is to move the uplinks to the CR routers from where they land now, on the Leaf devices in racks E1/F1, to the Spine devices in those racks. This should be possible without interruption, provided we move the links one at a time and test everything at each step.

Step2.png (461×643 px, 46 KB)

Step 3 - Move remaining rack uplinks to Spines

This is really a multi-step operation. Basically the uplinks from leaf devices in racks E2, E3, F2 and F3 need to be moved from where they currently land (leaf switches in racks E1/F1) to the Spine switches in E1/F1.

Given the network topology/design they can be disabled one by one, moved to Spine, then the other uplink moved once traffic is flowing via the spine. OSPF costs can be adjusted temporarily to make this safe and ensure we validate links before they see real traffic.

step3.png (461×649 px, 58 KB)

As a final task at this step we should move the uplinks from lsw1-e1-eqiad and lsw1-f1-eqiad towards ssw1-e1-eqiad from port 51 to 54, to keep numbering consistent. We can also remove the direct link (on port et-0/0/52 either side) between these switches.

Step 4 - LVS Migration

Lastly we need to move the links from the 4 LVS load-balancers in rows A-D (lvs1017, lvs1018, lvs1019, lvs1020) from where they land on the Leaf devices in E1/F1 to the Spine layer. Plan is to end up with them cabled as follows:

LVS_Direct_Extension_NEW_RACKS.png (744×1 px, 126 KB)

NOTE: In theory this can happen at step 3 instead, and we have slightly more optimal routing during the transition. But I think it is probably easier to process this as the final step in terms of the overall migration.

There are two main things to consider for this step:

10G Termination

One element we need to consider is how to terminate the 10G-Base-LR connections from these servers on the QSFP28 ports in the Spine devices.

The best way to proceed, it seems to me, is to use 4X10GE-LR QSFP+ modules on the Spines, which run as 4 individual 10G links, and use breakout cables to connect to the existing WMF-managed patch panels in the same racks.

Optic Redundancy

Provided we do that the next question is whether we should land 2 LVS connections onto a single module on either switch or not. For instance the links from lvs1017 and lvs1018, currently landing on separate 10G ports of lsw1-e1, could both terminate into the same QSFP+ optic when they are moved to ssw1-e1. That obviously saves money and Spine ports, but potentially it is not a good idea in the case that the single QSFP+ module fails. We might be better using two QSFP+ modules and 2 ports for redundancy. This point needs to be discussed with traffic I expect.

Event Timeline

cmooney triaged this task as Medium priority.Nov 11 2022, 5:27 PM
cmooney created this task.

Just to update in terms of the LVS connections. After discussing with Brandon I thought it best if the links from all 4 LVS terminate on different optics to maximize redundancy.

In terms of the hardware to use, after checking with John Clark, it seems we have enough 4X10GE-LR QSFP+ modules and breakout cables in stock already to support that, so we will use those.

We'll connect lvs1017 and lvs1019 to ssw1-e1, and lvs1018 and lvs1020 to ssw1-f1. Given traffic levels this is optimal in the case a Spine fails. We'll need to do a cable swap at the Equinix patch panel in the new cage to do this, as the current link from lvs1019 goes to F1 and from lvs1018 to E1.

LVS_Direct_Extension_NEW_RACKS.png (744×1 px, 126 KB)

Mentioned in SAL (#wikimedia-operations) [2023-04-05T22:36:32Z] <topranks> enabling lsw1-e1-eqiad port et-0/0/51 to ssw1-e1-eqiad et-0/0/80 T322937

Change 906540 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer

https://gerrit.wikimedia.org/r/906540

Mentioned in SAL (#wikimedia-operations) [2023-04-06T14:40:43Z] <cmooney@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937"

Change 906540 merged by jenkins-bot:

[operations/homer/public@master] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer

https://gerrit.wikimedia.org/r/906540

Mentioned in SAL (#wikimedia-operations) [2023-04-06T14:42:39Z] <cmooney@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937"

Change 906627 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad

https://gerrit.wikimedia.org/r/906627

Icinga downtime and Alertmanager silence (ID=e7d20917-1f70-4c85-bea4-4fae89694441) set by cmooney@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down.

lsw1-e1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=09fdc8d3-92d3-4c3b-8e46-8c1befa6a846) set by cmooney@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: test on ssw1-e1-eqiad will take ospf on lsw1-f1-eqiad down.

lsw1-f1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2023-04-06T20:43:44Z] <cmooney@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937"

Mentioned in SAL (#wikimedia-operations) [2023-04-06T20:44:59Z] <cmooney@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937"

Change 912836 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Only process global vlan list in Juniper config on frack switches

https://gerrit.wikimedia.org/r/912836

Change 912836 merged by jenkins-bot:

[operations/homer/public@master] Only process global vlan list in Juniper config on frack switches

https://gerrit.wikimedia.org/r/912836

Change 912838 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add 'default' for tenant.slug in vlans else statement

https://gerrit.wikimedia.org/r/912838

Change 912838 merged by jenkins-bot:

[operations/homer/public@master] Add 'default' for tenant.slug in vlans else statement

https://gerrit.wikimedia.org/r/912838

Change 912846 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Avoid creating EVPN import policy with default accept if no Vlans

https://gerrit.wikimedia.org/r/912846

Change 912846 merged by jenkins-bot:

[operations/homer/public@master] Avoid creating EVPN import policy with default accept if no Vlans

https://gerrit.wikimedia.org/r/912846

Icinga downtime and Alertmanager silence (ID=12105eb2-e5ac-4f19-9896-9ba53e1acd48) set by cmooney@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down.

lsw1-e1-eqiad.mgmt
Routing issue

I hit an issue with the new spines in that the overlay loopback address was not reachable when they were connected to the rest of the network.

After some time with JTAC it seems that the QFX platform cannot perform pure EVPN routing with type 5s unless there is a local IRB interface on the box in an "up" state. This is some unusual restriction local to the box, when it receives a VXLAN encapsulated packet for the L3VNI it won't route it if there is no "up" irb int. The vlan associated with this irb int does not need to be bound to a L2VNI, or in any way part of the EVPN database. Juniper's explanation was:

The reason for this is that vxlan Type5 VNI need to be associated to Packet before giving to Kernel.
This vlan can be any random vlan and is used to get the reply from Kernel.

I couldn't find anything in Juniper's documentation on this restriction, looks like a poor implementation to me tbh. I guess type-5 was added to the spec after the L2 stuff, and probably they shoe-horned it in in some way expecting there always to be a local L2 component.

New plan

Anyway the upshot of that is that we can't proeed to step 2 as I had planned. What I will do instead is make plans to tackle step 4 (moving LVS uplinks), first. This will mean we have a local trunk port with vlans on it, so we can make an irb int and get around the restriction.

I'll do some prep on the LVS move and then discuss with traffic about doing the move.

lvs1020 is currently the "secondary" lvs in eqiad, so I'd propose we start with trying to do that one if we can.

It's currently connected in rack F1, so I'd propose we move the b-end of the cable as follows:

Current deviceCurrent portNew deviceNew port
lsw1-f1-eqiadxe-0/0/47ssw1-f1-eqiadxe-0/0/33

I'll try to arrange with dc-ops to do the change, should only require the cable move and come right back up (we need to configure the port on the new switch in advance of course).

Change 906627 abandoned by Cathal Mooney:

[operations/puppet@production] Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad

Reason:

Need to bring these into mgmt one at a time as we enable the links due to JunOS constraint.

https://gerrit.wikimedia.org/r/906627

Change 921261 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Puppet additions to bring ssw1-f1-eqiad under management

https://gerrit.wikimedia.org/r/921261

Mentioned in SAL (#wikimedia-operations) [2023-05-19T13:26:04Z] <topranks> Adding vlan config for row e/f vlans on ssw1-f1-eqiad (T322937)

Icinga downtime and Alertmanager silence (ID=c4ef01af-e7d5-458f-ae46-17500f124165) set by cmooney@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1

lvs1020.eqiad.wmnet

Change 921261 merged by Cathal Mooney:

[operations/puppet@production] Puppet additions to bring ssw1-f1-eqiad under management

https://gerrit.wikimedia.org/r/921261

The migration went fine today, very quick move and all came up as expected. EVPN MAC-move BGP signalling worked flawlessly was nice to see in action :)

I've addedd a comment on the cabling task to plan out the moves for the remaining 3 lvs servers, after which we'll tackle the CR uplinks and finally the LEAF <-> SPINE links.

Thanks all for their help so far.

Change 922508 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool eqiad (emergency patch, do not merge until required)

https://gerrit.wikimedia.org/r/922508

Change 922520 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Puppet additions for ssw1-e1-eqiad

https://gerrit.wikimedia.org/r/922520

Change 922520 merged by Cathal Mooney:

[operations/puppet@production] Puppet additions for ssw1-e1-eqiad

https://gerrit.wikimedia.org/r/922520

Mentioned in SAL (#wikimedia-operations) [2023-05-23T15:46:52Z] <sukhe@deploy1002> Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937

Mentioned in SAL (#wikimedia-operations) [2023-05-23T15:56:38Z] <topranks> moving lvs1018 connection to rack E1 from lsw1-e1-eqiad to ssw1-e1-eqiad T322937

Mentioned in SAL (#wikimedia-operations) [2023-05-23T16:22:54Z] <sukhe@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 36m 02s)

Change 922508 abandoned by Ssingh:

[operations/dns@master] depool eqiad (emergency patch, do not merge until required)

Reason:

no longer required

https://gerrit.wikimedia.org/r/922508

Icinga downtime and Alertmanager silence (ID=03f7b2ab-bdea-4c56-ac41-3ec30004db4a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad

cr1-eqiad,lsw1-e1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=cf76e0ba-8648-48a0-beed-fe7b60f79656) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad

cr2-eqiad,lsw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=8f44dd48-0cac-4bfd-907a-512dfa686d40) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad

lsw1-e[1-2]-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=c43be552-7ced-4f58-99c1-a10b5984bf3a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1

lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=37545969-c51e-450d-9ef0-5fadfd151520) set by cmooney@cumin1001 for 0:30:00 on 3 host(s) and their services with reason: Migrate lsw1-e3-eqiad uplinks to spine

lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt

Change 923387 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Move row E/F core router uplinks to Spine switches

https://gerrit.wikimedia.org/r/923387

Change 923387 merged by jenkins-bot:

[operations/homer/public@master] Move row E/F core router uplinks to Spine switches

https://gerrit.wikimedia.org/r/923387

Change 923395 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust Eqiad row E/F switch parents in hierdata after cable moves

https://gerrit.wikimedia.org/r/923395

Step 2 - Move CR Uplinks has now been completed.

We are also 50% of the way through steps 3 and 4. Will continue with the remaining links when possible and close out the task.

Change 923395 merged by Cathal Mooney:

[operations/puppet@production] Adjust Eqiad row E/F switch parents in hierdata after cable moves

https://gerrit.wikimedia.org/r/923395

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:48:33Z] <brett@deploy1002> Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:50:57Z] <brett@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 02m 24s)

Mentioned in SAL (#wikimedia-operations) [2023-05-31T15:51:03Z] <brett@deploy1002> Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937

Mentioned in SAL (#wikimedia-operations) [2023-05-31T16:59:31Z] <brett@deploy1002> Locking from deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937

Mentioned in SAL (#wikimedia-operations) [2023-05-31T17:10:38Z] <brett@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in codfw, blocking deploys T322937 (duration: 11m 07s)