Page MenuHomePhabricator

Codfw row A/B top-of-rack switch refresh
Closed, ResolvedPublic

Description

New Juniper QFX5120-48Y top-of-rack switches have been delivered to codfw under T312138. These are intended to replace the existing switches in rows A and B in codfw, as part of a normal refresh cycle.

We have a current requirement for L2 Vlans that stretch across multiple racks (Ganeti, LVS), which for the current rows is achieved with Juniper's virtual-chassis feature. A different approach to this will be adopted on the replacement switches, using VXLAN/EVPN similar to the setup in Eqiad rows E and F. The plan involves deploying 2 QFX5120-32C Spine/aggregation switches to interconnect the top-of-racks in rows A and B, but these have not yet been delivered. Current estimate for those is August 2023.

In a meeting on Jan 24th 2023 Infra Foundations (Cathal, Arzhel, Riccardo) and DC-Ops (Papaul) agreed that there was not great urgency, and we should wait until the Spine switches are delivered and ready before we start moving anything. In other words no interim plan using only the 5120-48Y devices is being contemplated.

Physical Installation and Planning

Having the top-of-rack devices does allow us to get some things prepped in advance, however, and to plan the build/migration.

Spine Physical Location

The plan currently is to install the two Spine switches, aggregating rows A and B, in racks A1 and A8. Those locations make the uplinks to the CR routers (also in those racks) easier, and was recommended by Papaul. It was provisionally decided to place the "future" Spines (that will aggregate replacement switches in rows C and D when the time comes) in racks D1 and D8. But that is not part of the current task and can be revisited when the time comes.

Rack location

Papaul expressed a preference to install the switches in the top rows of all racks. This makes it easier to move the switches in and out of the rack, as they are higher than the PDUs so no power connectors get in the way.

Cables

Based on that we'll need 32 x single-mode LC-LC fibers, but I'm unsure of the exact lengths between all the racks:

Rack 1Rack 2DescQtyLengthcable ID
A1A1Spine<->Leaf within rack13mm230403800036
A1A1Spine <->cr13m230403800035
A1A2Spine<->Leaf15m
A1A3Spine<->Leaf15m
A1A4Spine<->Leaf15m
A1A5Spine<->Leaf18m
A1A6Spine<->Leaf18m
A1A7Spine<->Leaf18m
A1A8Spine<->Leaf (leaf in each has link to spine in other)28m
A1B2Spine<->Leaf112m230403800009
A1B3Spine<->Leaf112m230403800006
A1B4Spine<->Leaf112m230403800001
A1B5Spine<->Leaf112m230403800004
A1B6Spine<->Leaf112m230403800002
A1B7Spine<->Leaf112m230403800007
A1B8Spine<->Leaf112m230403800005
A8A1Spine<->Leaf18m230403800017
A8A2Spine<->Leaf18m230403800026
A8A3Spine<->Leaf18m230403800021
A8A4Spine<->Leaf18m230403800024
A8A5Spine<->Leaf15m230403800028
A8A6Spine<->Leaf15m230403800032
A8A7Spine<->Leaf15m230403800030
A8A8Spine<->Leaf within rack13mm230403800017
A8A8Spine<->cr13m230403800040
A8B2Spine<->Leaf112m230403800016
A8B3Spine<->Leaf112m230403800003
A8B4Spine<->Leaf112m230403800013
A8B5Spine<->Leaf112m230403800015
A8B6Spine<->Leaf112m230403800012
A8B7Spine<->Leaf112m230403800008
A8B8Spine<->Leaf112m230403800010

Transceiver

We will not some 1000Base-T SFP 100m to connect 1G server to the new switches. I have a Total of 76 right now on side.

Zeroconf

The preference is to use Zeroconf to take care of the initial base config for the new switches. This will take some development time but we've not identified any blockers thus far.

Configuration

Most of the Juniper configuration for these devices is currently automated through Homer. The one element that needs to be added is the BGP EVPN neighbor configuration (see T327934).

IP Allocations / Netbox

One element we may want to improve on is IP allocation and device assignment in Netbox, as well as DNS zone generation. There are a lot of point-to-point links, new subnets, loopbacks, irb interfaces etc to be added across all 17 devices. For Eqiad row E/F one-off scripting was used to generate some of this, but it may be worth developing a more robust, re-usable scripting for this as we'll likely need it again.

Migration

Once all the new switches are in place, connected and configured we can begin the work of migrating existing hosts.

Bridge existing Vlans?

Similar to Eqiad row E/F, the plan will be to add a per-rack private subnet which will be the default for new hosts installed in each rack. Ultimately the desire is for all hosts requiring normal private-vlan connectivity to be moved to these new, per-rack Vlans.

Unfortunately some hosts, specifically our Ganeti servers, have a requirement to be on the same Vlan (for VM live motion) and be in separate racks (for redundancy and operational flexibility). In the absence of any host-level L2-extention or routed solution to this (see T300152), we will likely need to provision a row-wide Vlan on the new switches for these hosts. The simplest option is probably to extend the existing private and public Vlans to the new switches and use those, as it avoids renumbering.

VC to EVPN switch connectivity

Extending the Vlans from the existing virtual-chassis to the new switches presents some challenges. As these are important production networks we need to have redundant connectivity in place. Connecting the VC master switches (2 and 7) to the EVPN Spine switches is probably the sensible way to physically connect the devices.

This gives us a problem in terms of L2 loop-prevention, however. If both Spines have independent trunks to the VC, with the same allowed Vlans, we'll create a loop. One solution that pops into my head is to create an ESI-LAG between the Spines and connect to the VC from that. Alternately we can look at maybe using Spanning Tree or other options.

Renumbering

If we don't extend the existing Vlans to the new switches we will need to renumber hosts when their physical connections are moved from old to new. And even if we do extend the Vlans it might make sense to renumber them at this point anyway (only one interruption for the host, and we have to do it eventually).

To allow for renumbering some development will need to happen to support a "--renumber" toggle for the reimage cookbook, which should delete the hosts existing IP allocation and add a new one.

Renumbering presents additional challenges in terms of services running on the hosts, if they come back online with different IPs. A few things we need to consider (there are likely more):

  • DNS needs to be updated, old entries can still be in DNS caches
    • Is it possible to change the DNS TTLs in advance to help us here?
  • We may have hardcoded IPs in puppet for certain things. Possibly the renumbering script could perform a git grep of the IP in multiple repositories to look for these (like the decommissioning cookbook):
    • Puppet
    • Puppet private
    • Mediawiki-config
    • Deployment charts
    • homer-public
  • DNS record resolved at catalog compile time by the Puppet master and those resolved for example by ferm at reload time (but could be any other service) will need update either forcing a puppet master or with a ferm reload or with a specific service reload/restart.
  • Databases:
    • DB grants are issued per-IP
    • mediawiki connects to the DB via IP
    • dbctl has the IPs of the servers and gives it to the mediawiki config stored in etcd
    • Backend servers behind LVS: TBD
    • Ganeti servers: depends on the whole Ganeti discussion

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
ResolvedPapaul
Resolved ayounsi
Resolvedcmooney

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for the summary!

Some additional notes/thoughts:

  • public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sites) and limiting wasting IPs and making renumbering not necessary

VC to EVPN switch connectivity

the current rows A and B have all their 40G ports in use, so unless we manage to decom 1 switch (asw-b1-codfw, as the rack is being dedicated to WMCS) we will have to use 10G LAGs.
When we did similar migration in the past we used a single LAG to prevent loops, in that case the switch on which terminates the LAG on the new fabric would be a SPOF. I don't have experience with ESI-LAG, let's see what the trade-offs are.

On the renumbering, to help make some of the move easier (especially the low hanging fruits) and test any automation script an idea is to start renumbering the hosts on their current switches.
For example create private1-a4-codfw on the existing row A virtual-chassis, identify which hosts are not blockers (eg. hosts not behind LVS, not Ganeti hosts, etc).
Those hosts will then be easily migrated to the new fabric top of rack when it's ready: during a maintenance, turn off the relevant vlan on the core router, move the hosts' cables, and start the prefix advertisement on the new fabric.

  • public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sites) and limiting wasting IPs and making renumbering not necessary

Sure. I guess the only drawback there is moving the servers which may already be spread out. But overall it works.

VC to EVPN switch connectivity

the current rows A and B have all their 40G ports in use, so unless we manage to decom 1 switch (asw-b1-codfw, as the rack is being dedicated to WMCS) we will have to use 10G LAGs.

Maybe we should aim for that. If we do it we should be mindful of the issues we've seen before changing vc-port to regular trunks. But hopefully the upcoming upgrades will prevent any similar funny stuff.

When we did similar migration in the past we used a single LAG to prevent loops, in that case the switch on which terminates the LAG on the new fabric would be a SPOF. I don't have experience with ESI-LAG, let's see what the trade-offs are.

Yeah if we can tolerate the SPOF it's certainly easier than implementing the multi-chassis solution. In terms of ESI-LAG I've not done it before either, I assume it's relatively straightforward and reliable (it's Juniper's recommended approach these days). But definitely would require some decent research/learning/testing time so if we can avoid it great.

@ayounsi @Papaul one other thing we didn't discuss last week was QSFP28 optics for the 100G switch -> switch links (and CR uplinks) We used 100GBase-CWDM4's in Eqiad, with duplex single-mode fiber. It didn't work out that much more expensive than using 100GBase-SR4 due to the MPO / multi-core fiber they need being pricier. But we also had the cross-cage links there, so regular LC connectors were required for some of them, a constraint we don't have in codfw.

I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is.

I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is.

No strong preference, consistency with eqiad makes sens to me, what's easiest for @Papaul as well. We won't be able to re-use the existing cabling as the two fabrics will run in parallel for a while. And the current infra is 40G so we won't need to re-use the optics neither.

@Papaul in terms of the cables we will need to begin as follows. I'm assuming here we go with 100GBase-CWDM4, and therefore single-mode lc-lc links. If you'd prefer we use multi-mode cables with MPO connectors we can revise.

1: Cables

Based on that we'll need 32 x single-mode LC-LC fibers, but I'm unsure of the exact lengths between all the racks. See table in task description for full list.

2: Optic Modules

To terminate these either side we will need to order 64 x 100GBase-CWDM4 optics, plus I'd recommend getting 2 spares. So 66 in total.

Other considerations

CR Links

I'm assuming here we can run 100G links to the QSFP28 ports on the CRs. Based on current usage we should be able to add 1 x 100G link on port 2 or 5 of either PIC on the MPC7E cards. Currently they're all at 120G, with 3x40G used, adding 100 brings that to 220G, under the 240G limit.

NOTE: We potentially could also do "cross links" from each CR to both Spines. At 100G this will use up all remaining bw on the MPC7E's. Or we could decide to cable like this but run at 40G, to increase redundancy but not bandwidth. @ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad.

Migration Strategy / Direct links between VC and EVPN fabrics

These totals do not include cabling from the new Spines to existing virtual-chassis switches, which would be required if our plan is to bridge existing Vlans to the new switches (and thus allow us to move hosts from old to new switch without any change on hosts).

That question is tricky. Currently we have no QSFP+ free ports on the VC switches to facilitate such connections. Bringing cloudsw1-b1-codfw live, and migrating the cloud hosts in that rack to it, will free up 1 such port on asw-b2-codfw and asw-b7-codfw, which could then be re-used to connect to ssw1-a1-codfw/ssw1-a8-codfw.

But we don't have a similar option for row A, so I'm not sure what might be realistic here. Either way I suspect we could re-use the 40G optics in use if we re-use ports, or if we do something else like use 10G links we will need to order those closer the time.

@cmooney I was about to update the table but I can't only you can. So for everything going from A1 to Bx and A8 to Bx should be 12m (x=1,2,3,4,5,6,7,8). I will get up the numbers of A1 to Ay and A8 to Ay (y=2,3,4,5,6,7) some time next week. Thanks

@ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad.

Agreed!

I was about to update the table but I can't only you can.

I copied the table to the task description

@ayounsi thanks for updating the desc!

@Papaul I'll update the table with the info provided and get back to you if any more questions.

I'll also put together a cost comparison of SR4/MPO vs CWDM4/LC-Duplex for the runs. There are quite a lot, want to make sure we're not needlessly wasting foundation funds, esp. in current climate.

@cmooney I update the table with lengths between all the racks.

RobH closed subtask Restricted Task as Resolved.May 1 2023, 12:22 PM
Papaul updated the task description. (Show Details)

@Papaul thanks for the work documenting the cable IDs. I've put the ones from above in Netbox now.

There is one discrepancy, the same label is listed for two different runs:

A8A1Spine<->Leaf18m230403800017
A8A8Spine<->Leaf within rack13mm230403800017

I added those with a generic label to Netbox if you can check / confirm the right ones.

https://netbox.wikimedia.org/dcim/cables/?color=&length=&length_unit=&q=changeme_&site_id=9

Thanks!

Change 954684 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW

https://gerrit.wikimedia.org/r/954684

Change 954684 merged by Cathal Mooney:

[operations/dns@master] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW

https://gerrit.wikimedia.org/r/954684

Change 954697 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Homer YAML additions for new row A/B switches in Codfw

https://gerrit.wikimedia.org/r/954697

Change 954893 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for Netbox generated dns for new per-rack codfw subnets

https://gerrit.wikimedia.org/r/954893

Change 954896 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add static network defs and DHCP config for new codfw subnets

https://gerrit.wikimedia.org/r/954896

netbox cable id update for ssw1-a8 to lsw1-a1 and lsw-a8

Change 954893 merged by Cathal Mooney:

[operations/dns@master] Add includes for Netbox generated dns for new per-rack codfw subnets

https://gerrit.wikimedia.org/r/954893

Change 954980 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for new /24s used in EVPN underlay network codfw

https://gerrit.wikimedia.org/r/954980

Change 954980 merged by Cathal Mooney:

[operations/dns@master] Add includes for new /24s used in EVPN underlay network codfw

https://gerrit.wikimedia.org/r/954980

cmooney closed subtask Restricted Task as Resolved.Sep 6 2023, 10:39 AM

@Papaul I've done some testing and I'm confident the IP GW moves for the row subnets to the Spines can be done gracefully. I've yet to work on BGP, but either way I think we need to plan out the links between existing switch rows and the new spines. As discussed we'll disconnect the links from CRs to ASWs to free up the ASW ports for these.

Ultimately we'll have:

RowVC SwitchVC Switch PortSpine SwitchSpine Switch Port
Aasw-a2-codfwet-2/0/52ssw1-a1-codfwet-0/0/29
Aasw-a7-codfwet-7/0/52ssw1-a8-codfwet-0/0/29
Basw-b2-codfwet-2/0/51ssw1-a1-codfwet-0/0/30
Basw-b7-codfwet-7/0/52ssw1-a8-codfwet-0/0/30

Those ports on the vc switches are in use at the moment though, for the uplinks to the CRs. So we need to move them 1 by 1, co-ordinated with netops, while we make changes on the devices to move the GW IP from CRs to SPINEs.

The ASW's have 40GBase-SR4 optics in them already, we can re-use those. We can take the optics from the CRs and use them to terminate on the Spines so should be ok for modules. I'm not 100% sure if you need new multi-core/MPO multi-mode fibers, or if we can re-use the ones already in place (given they go between the same cabs).

Anyway just a heads up so you can be prepared. If you want me to open a separate task let me know. Thanks!

@cmooney thanks for the update. I think we can reuse those the MPO

In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather than on the top-of-racks as in the old design.

This needs to be carefully co-ordinated to not cause interruption, but in terms of the final cabling it will be like this:

LVSOld SwitchOld PortNew SwitchNew Port
lvs2013asw-a2-codfwxe-2/0/43ssw1-a1-codfwxe-0/0/32
lvs2014asw-a4-codfwxe-4/0/47ssw1-a1-codfwxe0/0/33
lvs2013asw-b2-codfwxe-2/0/43ssw1-a8-codfwxe-0/0/32
lvs2014asw-b4-codfwxe-4/0/47ssw1-a8-codfwxe0/0/33

We need to be careful to take note of this when migrating in cabs A2/A4/B2/B4.

Change 954896 merged by Cathal Mooney:

[operations/puppet@production] Add static network defs and DHCP config for new codfw subnets

https://gerrit.wikimedia.org/r/954896

Change 954697 merged by jenkins-bot:

[operations/homer/public@master] Homer YAML additions for new row A/B switches in Codfw

https://gerrit.wikimedia.org/r/954697

Change 959873 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Support configuration of EVPN anycast GW on switches

https://gerrit.wikimedia.org/r/959873

Change 959873 merged by jenkins-bot:

[operations/homer/public@master] Support configuration of EVPN anycast GW on switches

https://gerrit.wikimedia.org/r/959873

ayounsi mentioned this in Unknown Object (Task).Oct 2 2023, 7:14 AM

@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.

Change 965148 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add puppet elements for newly added switches.

https://gerrit.wikimedia.org/r/965148

Change 965148 merged by Cathal Mooney:

[operations/puppet@production] Add puppet elements for newly added switches.

https://gerrit.wikimedia.org/r/965148

@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.

Thanks. I think the plan really should be to keep the existing Ganeti logic, and not try to move the existing ganeti hosts to the per-rack vlans until we are in a position to move forward with T300152: Investigate Ganeti in routed mode. The logic we have at the POPs, with 2 racks, wouldn't be a good fit for our larger sites. We can support the legacy row-wide vlans until that is ready, and still migrate the remaining hosts to the new per-rack vlans.

Change 973752 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add netboot config for new private vlans in codfw rows A/B

https://gerrit.wikimedia.org/r/973752

Change 973752 merged by Cathal Mooney:

[operations/puppet@production] Add netboot config for new private vlans in codfw rows A/B

https://gerrit.wikimedia.org/r/973752

cmooney renamed this task from Plan codfw row A/B top-of-rack switch refresh to Codfw row A/B top-of-rack switch refresh.Jan 11 2024, 2:02 PM
cmooney closed this task as Resolved.EditedMar 22 2024, 4:41 PM
cmooney claimed this task.

Closing this one, I've made some notes on wikitech below about how to approach these for future rows.

https://wikitech.wikimedia.org/wiki/Migrate_from_VC_switch_stack_to_EVPN

We still have to renumber all the end devices, but we can deal with that separately to the migration to new switches.

T354869: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets