New Juniper QFX5120-48Y top-of-rack switches have been delivered to codfw under T312138. These are intended to replace the existing switches in rows A and B in codfw, as part of a normal refresh cycle.
We have a current requirement for L2 Vlans that stretch across multiple racks (Ganeti, LVS), which for the current rows is achieved with Juniper's virtual-chassis feature. A different approach to this will be adopted on the replacement switches, using VXLAN/EVPN similar to the setup in Eqiad rows E and F. The plan involves deploying 2 QFX5120-32C Spine/aggregation switches to interconnect the top-of-racks in rows A and B, but these have not yet been delivered. Current estimate for those is August 2023.
In a meeting on Jan 24th 2023 Infra Foundations (Cathal, Arzhel, Riccardo) and DC-Ops (Papaul) agreed that there was not great urgency, and we should wait until the Spine switches are delivered and ready before we start moving anything. In other words no interim plan using only the 5120-48Y devices is being contemplated.
Physical Installation and Planning
Having the top-of-rack devices does allow us to get some things prepped in advance, however, and to plan the build/migration.
Spine Physical Location
The plan currently is to install the two Spine switches, aggregating rows A and B, in racks A1 and A8. Those locations make the uplinks to the CR routers (also in those racks) easier, and was recommended by Papaul. It was provisionally decided to place the "future" Spines (that will aggregate replacement switches in rows C and D when the time comes) in racks D1 and D8. But that is not part of the current task and can be revisited when the time comes.
Rack location
Papaul expressed a preference to install the switches in the top rows of all racks. This makes it easier to move the switches in and out of the rack, as they are higher than the PDUs so no power connectors get in the way.
Cables
Based on that we'll need 32 x single-mode LC-LC fibers, but I'm unsure of the exact lengths between all the racks:
Rack 1 | Rack 2 | Desc | Qty | Length | cable ID |
---|---|---|---|---|---|
A1 | A1 | Spine<->Leaf within rack | 1 | 3mm | 230403800036 |
A1 | A1 | Spine <->cr | 1 | 3m | 230403800035 |
A1 | A2 | Spine<->Leaf | 1 | 5m | |
A1 | A3 | Spine<->Leaf | 1 | 5m | |
A1 | A4 | Spine<->Leaf | 1 | 5m | |
A1 | A5 | Spine<->Leaf | 1 | 8m | |
A1 | A6 | Spine<->Leaf | 1 | 8m | |
A1 | A7 | Spine<->Leaf | 1 | 8m | |
A1 | A8 | Spine<->Leaf (leaf in each has link to spine in other) | 2 | 8m | |
A1 | B2 | Spine<->Leaf | 1 | 12m | 230403800009 |
A1 | B3 | Spine<->Leaf | 1 | 12m | 230403800006 |
A1 | B4 | Spine<->Leaf | 1 | 12m | 230403800001 |
A1 | B5 | Spine<->Leaf | 1 | 12m | 230403800004 |
A1 | B6 | Spine<->Leaf | 1 | 12m | 230403800002 |
A1 | B7 | Spine<->Leaf | 1 | 12m | 230403800007 |
A1 | B8 | Spine<->Leaf | 1 | 12m | 230403800005 |
A8 | A1 | Spine<->Leaf | 1 | 8m | 230403800017 |
A8 | A2 | Spine<->Leaf | 1 | 8m | 230403800026 |
A8 | A3 | Spine<->Leaf | 1 | 8m | 230403800021 |
A8 | A4 | Spine<->Leaf | 1 | 8m | 230403800024 |
A8 | A5 | Spine<->Leaf | 1 | 5m | 230403800028 |
A8 | A6 | Spine<->Leaf | 1 | 5m | 230403800032 |
A8 | A7 | Spine<->Leaf | 1 | 5m | 230403800030 |
A8 | A8 | Spine<->Leaf within rack | 1 | 3mm | 230403800017 |
A8 | A8 | Spine<->cr | 1 | 3m | 230403800040 |
A8 | B2 | Spine<->Leaf | 1 | 12m | 230403800016 |
A8 | B3 | Spine<->Leaf | 1 | 12m | 230403800003 |
A8 | B4 | Spine<->Leaf | 1 | 12m | 230403800013 |
A8 | B5 | Spine<->Leaf | 1 | 12m | 230403800015 |
A8 | B6 | Spine<->Leaf | 1 | 12m | 230403800012 |
A8 | B7 | Spine<->Leaf | 1 | 12m | 230403800008 |
A8 | B8 | Spine<->Leaf | 1 | 12m | 230403800010 |
Transceiver
We will not some 1000Base-T SFP 100m to connect 1G server to the new switches. I have a Total of 76 right now on side.
Zeroconf
The preference is to use Zeroconf to take care of the initial base config for the new switches. This will take some development time but we've not identified any blockers thus far.
Configuration
Most of the Juniper configuration for these devices is currently automated through Homer. The one element that needs to be added is the BGP EVPN neighbor configuration (see T327934).
IP Allocations / Netbox
One element we may want to improve on is IP allocation and device assignment in Netbox, as well as DNS zone generation. There are a lot of point-to-point links, new subnets, loopbacks, irb interfaces etc to be added across all 17 devices. For Eqiad row E/F one-off scripting was used to generate some of this, but it may be worth developing a more robust, re-usable scripting for this as we'll likely need it again.
Migration
Once all the new switches are in place, connected and configured we can begin the work of migrating existing hosts.
Bridge existing Vlans?
Similar to Eqiad row E/F, the plan will be to add a per-rack private subnet which will be the default for new hosts installed in each rack. Ultimately the desire is for all hosts requiring normal private-vlan connectivity to be moved to these new, per-rack Vlans.
Unfortunately some hosts, specifically our Ganeti servers, have a requirement to be on the same Vlan (for VM live motion) and be in separate racks (for redundancy and operational flexibility). In the absence of any host-level L2-extention or routed solution to this (see T300152), we will likely need to provision a row-wide Vlan on the new switches for these hosts. The simplest option is probably to extend the existing private and public Vlans to the new switches and use those, as it avoids renumbering.
VC to EVPN switch connectivity
Extending the Vlans from the existing virtual-chassis to the new switches presents some challenges. As these are important production networks we need to have redundant connectivity in place. Connecting the VC master switches (2 and 7) to the EVPN Spine switches is probably the sensible way to physically connect the devices.
This gives us a problem in terms of L2 loop-prevention, however. If both Spines have independent trunks to the VC, with the same allowed Vlans, we'll create a loop. One solution that pops into my head is to create an ESI-LAG between the Spines and connect to the VC from that. Alternately we can look at maybe using Spanning Tree or other options.
Renumbering
If we don't extend the existing Vlans to the new switches we will need to renumber hosts when their physical connections are moved from old to new. And even if we do extend the Vlans it might make sense to renumber them at this point anyway (only one interruption for the host, and we have to do it eventually).
To allow for renumbering some development will need to happen to support a "--renumber" toggle for the reimage cookbook, which should delete the hosts existing IP allocation and add a new one.
Renumbering presents additional challenges in terms of services running on the hosts, if they come back online with different IPs. A few things we need to consider (there are likely more):
- DNS needs to be updated, old entries can still be in DNS caches
- Is it possible to change the DNS TTLs in advance to help us here?
- We may have hardcoded IPs in puppet for certain things. Possibly the renumbering script could perform a git grep of the IP in multiple repositories to look for these (like the decommissioning cookbook):
- Puppet
- Puppet private
- Mediawiki-config
- Deployment charts
- homer-public
- DNS record resolved at catalog compile time by the Puppet master and those resolved for example by ferm at reload time (but could be any other service) will need update either forcing a puppet master or with a ferm reload or with a specific service reload/restart.
- Databases:
- DB grants are issued per-IP
- mediawiki connects to the DB via IP
- dbctl has the IPs of the servers and gives it to the mediawiki config stored in etcd
- Backend servers behind LVS: TBD
- Ganeti servers: depends on the whole Ganeti discussion