Change Details

**Devices**: ssw1-e1-eqiad, ssw1-f1-eqiad **When**: Wed Jun 4th 2024, 15:00 UTC **Downtime**: 60 minutes As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features. This Spine-layer switches can mostly be done without affecting servers, top-of-rack switches in each rack are connected to both, so we can upgrade one at a time without cutting comms to any rack. However we have connections from LVS servers in rows A-D landing on the Spine switches, so they can reach backend hosts in rows E/F. The LVS servers are connected as follows: |Host|Interface|Switch| |-------|--------------|----------| |lvs1017|enp94s0f0np0|ssw1-e1-eqiad| |lvs1018|enp94s0f0np0|ssw1-e1-eqiad| |lvs1019|enp94s0f0np0|ssw1-f1-eqiad| |lvs1020|enp94s0f0np0|ssw1-f1-eqiad| lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, and the reboot will halt comms to rows E/F from both lvs1019 and the backup lvs1020. While it might be possible to do ssw1-e1-eqiad by failing over lvs1017 and lvs1018 to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts server re-routed to the single backup host. So in both cases to complete the work we will need to depool eqiad in DNS, to stop traffic being sent to the LVS VIPs there, which will mean the break in connectivity from LVS hosts to rows E and F won't be an issue. Given that is the case we will plan to upgrade both Spine switches one after another, so both are done during a single window/depool. Plan would be to **depool eqiad at 13:00 UTC**, giving some time for DNS changes to take affect before starting the **first switch update at 14:00 UTC** and then proceeding to the next. Each switch upgrade is expected to take between 20-30 mintues to complete, after which we can verify things look good and repool the site.

**Devices**: ssw1-e1-e1-eqiad, ssw1-f1-eqiad **When**: TuesWed Jun 4th 2024, 15:00 UTC **Downtime**: 15-360 minutes As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features. This Spine-layer switches can mostly be done without affecting servers, top-of-rack switches in each rack are connected to both, so we can upgrade one at a time without cutting comms to any rack. The exception to this is theHowever we have connections from LVS servers in **remote racks**rows A-D landing on the Spine switches, so in this case LVS servers in rows A-D which have connections to the Spine switches in rows E and F to reach back-end servers in those rowsthey can reach backend hosts in rows E/F. On ssw1-e1-eqiad we have theThe LVS servers are connected as following connections from LVS servers:s: |Host|Interface|Switch| |-------|--------------|----------| ```|lvs1017|enp94s0f0np0|ssw1-e1-eqiad| |lvs1017 8|enp94s0f0np0|ssw1-e1-eqiad| |lvs1018 9|enp94s0f0np0|ssw1-f1-eqiad| ```|lvs1020|enp94s0f0np0|ssw1-f1-eqiad| lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, and the reboot will halt comms to rows E/F from both lvs1019 and the backup lvs1020. While it might be possible to do ssw1-e1-eqiad by failing over lvs1017 and lvs1018 to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts server re-routed to the single backup host. So in both cases to complete the work we will need to depool eqiad in DNS, to stop traffic being sent to the LVS VIPs there, which will mean the break in connectivity from LVS hosts to rows E and F won't be an issue. I'd propose we disable PyBal on these servers in advance of the upgradeGiven that is the case we will plan to upgrade both Spine switches one after another, which will push traffic they handle over to lvs1020so both are done during a single window/depool. Plan would be to **depool eqiad at 13:00 UTC**, which has connectivity to the row E/F vlans through ssw1-f1-eqiadgiving some time for DNS changes to take affect before starting the **first switch update at 14:00 UTC** and then proceeding to the next. Once the upgrade is complete we can re-enable PyBal on the above two hostsEach switch upgrade is expected to take between 20-30 mintues to complete, after which we can verify things look good and repool the site.