Change Details

**Devices**: ssw1-e1-eqiad, ssw1-f1-eqiad **When**: Thur Jun 6th 2024, 15:00 UTC **Downtime**: 60 minutes As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features. This Spine-layer switches can mostly be done without affecting servers. The top-of-rack switches in each rack have links to both Spines, so we can upgrade one at a time without cutting comms to any rack. However we do have connections from LVS servers in remote rows (A-D) which land on the spine switches so the load-balancers can reach backend servers in rows E/F. (The list of hosts/services with backends in these rows can is here: P63779) The LVS servers are connected as follows: |Host|Interface|Switch| |-------|--------------|----------| |lvs1017|enp94s0f0np0|ssw1-e1-eqiad| |lvs1018|enp94s0f0np0|ssw1-e1-eqiad| |lvs1019|enp94s0f0np0|ssw1-f1-eqiad| |lvs1020|enp94s0f0np0|ssw1-f1-eqiad| lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, as the reboot will also disrupt comms to rows E/F from the only backup, lvs1020. While it might be possible to do ssw1-e1-eqiad by failing both lvs1017 and lvs1018 over to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts handle re-routed to the single backup host. Initially we had planned to depool eqiad in DNS, to move services away from the load-balancer layer in eqiad and allow for the disruption in comms. Having discussed with Service Ops, however, it seems this was somewhat unrealistic, and a full site switchover would be required instead. That seems unrealistic to perform to handle a 20 minute outage to two 10G ports in our DC, so we will instead move these links in advance as follows: **1. Move lvs1017 link** # Disable PyBal on lvs1017, shifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-e1-eqiad xe-0/0/32// to //lsw1-e1-eqiad xe-0/0/8// # Test connectivity to the row E/F vlans from lvs1017 following the cable move # Re-enable PyBal on lvs1017 **2. Move lvs1018 link** # Disable PyBal on lvs1018, shifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-e1-eqiad xe-0/0/33// to //lsw1-e1-eqiad xe-0/0/9// # Test connectivity to the row E/F vlans from lvs1018 following the cable move # Re-enable PyBal on lvs1018 **3. Move lvs1019 link** # Disable PyBal on lvs1019, shifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-f1-eqiad xe-0/0/32// to //lsw1-f1-eqiad xe-0/0/8// # Test connectivity to the row E/F vlans from lvs1019 following the cable move # Re-enable PyBal on lvs1019 In advance we need to pre-provision ports on our Leaf swithces as follows to temporarily terminate the fibres from the 3 live LVS hosts: |LVS|Temp Leaf Switch|Port|Copy Config From|Port| |------|-----------------|-------|---------------------------|---| |lvs1017|lsw1-e1-eqiad|xe-0/0/8|ssw1-e1-eqiad|xe-0/0/32| |lvs1018|lsw1-e1-eqiad|xe-0/0/9|ssw1-e1-eqiad|xe-0/0/33| |lvs1019|lsw1-f1-eqiad|xe-0/0/8|ssw1-f1-eqiad|xe-0/0/32| Once the Spine switches are upgraded we repeat the steps detailed above, moving the physical connections back to the Spine switches from the temporary lsw ports. And then default the Leaf switch tempoary ports.

**Devices**: ssw1-e1-eqiad, ssw1-f1-eqiad **When**: Thur Jun 6th 2024, 15:00 UTC **Downtime**: 60 minutes As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features. This Spine-layer switches can mostly be done without affecting servers. The top-of-rack switches in each rack have links to both Spines, so we can upgrade one at a time without cutting comms to any rack. However we do have connections from LVS servers in remote rows (A-D) which land on the spine switches so the load-balancers can reach backend servers in rows E/F. (The list of hosts/services with backends in these rows can is here: P63779) The LVS servers are connected as follows: |Host|Interface|Switch| |-------|--------------|----------| |lvs1017|enp94s0f0np0|ssw1-e1-eqiad| |lvs1018|enp94s0f0np0|ssw1-e1-eqiad| |lvs1019|enp94s0f0np0|ssw1-f1-eqiad| |lvs1020|enp94s0f0np0|ssw1-f1-eqiad| lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, as the reboot will also disrupt comms to rows E/F from the only backup, lvs1020. While it might be possible to do ssw1-e1-eqiad by failing both lvs1017 and lvs1018 over to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts handle re-routed to the single backup host. So in both cases to complete the work we will need to depool eqiad in DNSInitially we had planned to depool eqiad in DNS, to move services away from the load-balancer layer in eqiad and allow for the disruption in comms. Having discussed with Service Ops, however, it seems this was somewhat unrealistic, to stop traffic being sent to the LVS VIPs thereand a full site switchover would be required instead. That seems unrealistic to perform to handle a 20 minute outage to two 10G ports in our DC, whichso we will mean the break in connectivity from LVS hosts to rows E and F won't be an issue.instead move these links in advance as follows: Given that is the case we will plan to upgrade both Spine switches one after another,**1. Move lvs1017 link** # Disable PyBal on lvs1017, shifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-e1-eqiad xe-0/0/32// to //lsw1-e1-eqiad xe-0/0/8// # Test connectivity to the row E/F vlans from lvs1017 following the cable move # Re-enable PyBal on lvs1017 **2. Move lvs1018 link** # Disable PyBal on lvs1018, shifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-e1-eqiad xe-0/0/33// to //lsw1-e1-eqiad xe-0/0/9// # Test connectivity to the row E/F vlans from lvs1018 following the cable move # Re-enable PyBal on lvs1018 **3. so both are done during a single window/depool. Plan would be to **depool eqiad at 14:00 UTC**Move lvs1019 link** # Disable PyBal on lvs1019, giving some time for DNS changes to take affect before starting the **first switch update at 15:00 UTC** and then proceeding to the next. Each switch upgrade is expected to take between 20-30 mintues to completshifting traffic it handles to lvs1020 # Move single-mode fibre and SFP+ module from //ssw1-f1-eqiad xe-0/0/32// to //lsw1-f1-eqiad xe-0/0/8// # Test connectivity to the row E/F vlans from lvs1019 following the cable move # Re-enable PyBal on lvs1019 In advance we need to pre-provision ports on our Leaf swithces as follows to temporarily terminate the fibres from the 3 live LVS hosts: |LVS|Temp Leaf Switch|Port|Copy Config From|Port| |------|-----------------|-------|---------------------------|---| |lvs1017|lsw1-e1-eqiad|xe-0/0/8|ssw1-e1-eqiad|xe-0/0/32| |lvs1018|lsw1-e1-eqiad|xe-0/0/9|ssw1-e1-eqiad|xe-0/0/33| |lvs1019|lsw1-f1-eqiad|xe-0/0/8|ssw1-f1-eqiad|xe-0/0/32| Once the Spine switches are upgraded we repeat the steps detailed above, after which we can verify things look good and repool the sitemoving the physical connections back to the Spine switches from the temporary lsw ports. And then default the Leaf switch tempoary ports.