Devices: ssw1-e1-eqiad, ssw1-f1-eqiad
When: Thur Jun 6th 2024, 12:00 UTC
Downtime: 60 minutes
As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features.
This Spine-layer switches can mostly be done without affecting servers. The top-of-rack switches in each rack have links to both Spines, so we can upgrade one at a time without cutting comms to any rack. However we do have connections from LVS servers in remote rows (A-D) which land on the spine switches so the load-balancers can reach backend servers in rows E/F. (The list of hosts/services with backends in these rows can is here: P63779)
The LVS servers are connected as follows:
Host | Interface | Switch |
---|---|---|
lvs1017 | enp94s0f0np0 | ssw1-e1-eqiad |
lvs1018 | enp94s0f0np0 | ssw1-e1-eqiad |
lvs1019 | enp94s0f0np0 | ssw1-f1-eqiad |
lvs1020 | enp94s0f0np0 | ssw1-f1-eqiad |
lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, as the reboot will also disrupt comms to rows E/F from the only backup, lvs1020.
While it might be possible to do ssw1-e1-eqiad by failing both lvs1017 and lvs1018 over to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts handle re-routed to the single backup host.
Initially we had planned to depool eqiad in DNS, to move services away from the load-balancer layer in eqiad and allow for the disruption in comms. Having discussed with Service Ops, however, it seems this was somewhat unrealistic, and a full site switchover would be required instead. That seems unrealistic to perform to handle a 20 minute outage to two 10G ports in our DC, so we will instead move these links in advance as follows:
1. Move lvs1017 link
- Disable PyBal on lvs1017, shifting traffic it handles to lvs1020
- Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/32 to lsw1-e1-eqiad xe-0/0/8
- Test connectivity to the row E/F vlans from lvs1017 following the cable move
- Re-enable PyBal on lvs1017
2. Move lvs1018 link
- Disable PyBal on lvs1018, shifting traffic it handles to lvs1020
- Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/33 to lsw1-e1-eqiad xe-0/0/9
- Test connectivity to the row E/F vlans from lvs1018 following the cable move
- Re-enable PyBal on lvs1018
3. Move lvs1019 link
- Disable PyBal on lvs1019, shifting traffic it handles to lvs1020
- Move single-mode fibre and SFP+ module from ssw1-f1-eqiad xe-0/0/32 to lsw1-f1-eqiad xe-0/0/8
- Test connectivity to the row E/F vlans from lvs1019 following the cable move
- Re-enable PyBal on lvs1019
In advance we need to pre-provision ports on our Leaf swithces as follows to temporarily terminate the fibres from the 3 live LVS hosts:
LVS | Temp Leaf Switch | Port | Copy Config From | Port |
---|---|---|---|---|
lvs1017 | lsw1-e1-eqiad | xe-0/0/8 | ssw1-e1-eqiad | xe-0/0/32 |
lvs1018 | lsw1-e1-eqiad | xe-0/0/9 | ssw1-e1-eqiad | xe-0/0/33 |
lvs1019 | lsw1-f1-eqiad | xe-0/0/8 | ssw1-f1-eqiad | xe-0/0/32 |
Once the Spine switches are upgraded we repeat the steps detailed above, moving the physical connections back to the Spine switches from the temporary lsw ports. And then default the Leaf switch tempoary ports.