Page MenuHomePhabricator

Upgrade Eqiad row E-F Spines to JunOS 22.2R3
Closed, ResolvedPublic

Description

Devices: ssw1-e1-eqiad, ssw1-f1-eqiad

When: Thur Jun 6th 2024, 12:00 UTC

Downtime: 60 minutes

As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features.

This Spine-layer switches can mostly be done without affecting servers. The top-of-rack switches in each rack have links to both Spines, so we can upgrade one at a time without cutting comms to any rack. However we do have connections from LVS servers in remote rows (A-D) which land on the spine switches so the load-balancers can reach backend servers in rows E/F. (The list of hosts/services with backends in these rows can is here: P63779)

The LVS servers are connected as follows:

HostInterfaceSwitch
lvs1017enp94s0f0np0ssw1-e1-eqiad
lvs1018enp94s0f0np0ssw1-e1-eqiad
lvs1019enp94s0f0np0ssw1-f1-eqiad
lvs1020enp94s0f0np0ssw1-f1-eqiad

lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, as the reboot will also disrupt comms to rows E/F from the only backup, lvs1020.

While it might be possible to do ssw1-e1-eqiad by failing both lvs1017 and lvs1018 over to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts handle re-routed to the single backup host.

Initially we had planned to depool eqiad in DNS, to move services away from the load-balancer layer in eqiad and allow for the disruption in comms. Having discussed with Service Ops, however, it seems this was somewhat unrealistic, and a full site switchover would be required instead. That seems unrealistic to perform to handle a 20 minute outage to two 10G ports in our DC, so we will instead move these links in advance as follows:

1. Move lvs1017 link

  1. Disable PyBal on lvs1017, shifting traffic it handles to lvs1020
  2. Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/32 to lsw1-e1-eqiad xe-0/0/8
  3. Test connectivity to the row E/F vlans from lvs1017 following the cable move
  4. Re-enable PyBal on lvs1017

2. Move lvs1018 link

  1. Disable PyBal on lvs1018, shifting traffic it handles to lvs1020
  2. Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/33 to lsw1-e1-eqiad xe-0/0/9
  3. Test connectivity to the row E/F vlans from lvs1018 following the cable move
  4. Re-enable PyBal on lvs1018

3. Move lvs1019 link

  1. Disable PyBal on lvs1019, shifting traffic it handles to lvs1020
  2. Move single-mode fibre and SFP+ module from ssw1-f1-eqiad xe-0/0/32 to lsw1-f1-eqiad xe-0/0/8
  3. Test connectivity to the row E/F vlans from lvs1019 following the cable move
  4. Re-enable PyBal on lvs1019

In advance we need to pre-provision ports on our Leaf swithces as follows to temporarily terminate the fibres from the 3 live LVS hosts:

LVSTemp Leaf SwitchPortCopy Config FromPort
lvs1017lsw1-e1-eqiadxe-0/0/8ssw1-e1-eqiadxe-0/0/32
lvs1018lsw1-e1-eqiadxe-0/0/9ssw1-e1-eqiadxe-0/0/33
lvs1019lsw1-f1-eqiadxe-0/0/8ssw1-f1-eqiadxe-0/0/32

Once the Spine switches are upgraded we repeat the steps detailed above, moving the physical connections back to the Spine switches from the temporary lsw ports. And then default the Leaf switch tempoary ports.

Event Timeline

cmooney triaged this task as Medium priority.Fri, May 31, 3:31 PM
cmooney created this task.
cmooney renamed this task from Upgrade ssw1-e1-eqiad to JunOS 22.2R3 to Upgrade Eqiad row E-F Spines to JunOS 22.2R3.Fri, May 31, 4:24 PM
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)
This comment was removed by cmooney.

@Jclark-ctr @VRiley-WMF unfortunately these switch upgrades require us to shift some cables around before/after the upgrade to avoid disrupting services.

We had planned to do it starting at 15:00 UTC tomorrow, Thursday Jun 6th, so 11am local time. Are either of you available at that time to assist? No worries if not, appreciate it's short notice, we planned this before we knew we'd need on-site assistance. If it doesn't suit please suggest an alternate time and we can do it then. Thanks.

@cmooney as it turns out, I will be out until June 10th.

@cmooney as it turns out, I will be out until June 10th.

No probs, enjoy the time off. I'll see if maybe John can cover or otherwise we can do it after you're back. Thanks!

I spoke to @Jclark-ctr earlier, we will do this commencing at 12:00 UTC tomorrow Thurs 6th Jun.

Icinga downtime and Alertmanager silence (ID=54328f3a-52e5-42cd-bdf1-26ee5617a4d5) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: moving lvs1017 link to row E from spine to leaf

lvs1017.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T11:56:34Z] <topranks> disabling PyBal on lvs1017 to allow for cable move T366361

Icinga downtime and Alertmanager silence (ID=512f5f90-4832-4c61-b0eb-75b61fcd6f8c) set by cmooney@cumin1002 for 1:30:00 on 18 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

lsw1-e[1-3,5-7]-eqiad.mgmt,lsw1-f[1-3,5-7]-eqiad.mgmt,ssw1-e1-eqiad,ssw1-e1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=76763bfc-4091-4d8a-b3f8-e84d96a9bd49) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: moving lvs1018 link to row E from spine to leaf

lvs1018.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:25:39Z] <topranks> disabling PyBal on lvs1018 to allow for cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:33:53Z] <topranks> disabling BGP to ssw1-e1-eqiad from cr1-eqiad in advance of upgrade T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:44:55Z] <topranks> disabling PyBal on lvs1019 to allow for cable move T366361

The first phase of this is complete, ssw1-e1-eqiad has been upgraded.

I am going to pause before completing ssw1-f1-eqiad as some of the output is strange after the upgrade of E1. I believe it is just a change in how things are reported in the upgraded JunOS, but I want to allow a period of stability before we complete the other one just in case.

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:14:29Z] <topranks> disabling BGP on cr2-eqiad towards ssw1-f1-eqiad prior to upgrade of ssw later T366361

Icinga downtime and Alertmanager silence (ID=2e3e9f53-54b4-4b8d-b9d6-ab280392b41c) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:56:57Z] <topranks> disable ssw1-f1-eqiad leaf-facing ports in advance of upgrade T366361

Icinga downtime and Alertmanager silence (ID=e84998aa-eea9-43ce-9047-23b408d134b5) set by cmooney@cumin1002 for 1:30:00 on 15 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

lsw1-e[1-3,5-7]-eqiad.mgmt,lsw1-f[1-3,5-7]-eqiad.mgmt,ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=8ea52962-5718-4917-aeee-12b979b25d42) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

ssw1-e1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-06T15:29:51Z] <topranks> rebooting ssw1-f1-eqiad to install new JunOS release T366361

ssw1-f1-eqiad has now been upgraded to 22.2R3 also, no issues to report.

Mentioned in SAL (#wikimedia-operations) [2024-06-06T16:50:27Z] <topranks> disabling pybal on lvs1019 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=64a8433f-aaa7-4b28-a08b-f75a8455a6c9) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1019 link back to ssw1-f1-codfw

lvs1019.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:11:19Z] <topranks> re-enabling pybal on lvs1019 after cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:11:48Z] <topranks> disabling pybal on lvs1018 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=aade48b6-4a47-473b-8a5b-7069b1d13cce) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1018 link back to ssw1-e1-codfw

lvs1018.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:23:43Z] <topranks> re-enabling pybal on lvs1018 after cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:26:12Z] <topranks> disabling pybal on lvs1017 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=13da9282-f3eb-4ff3-b775-5f7e24d4f1f9) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1017 link back to ssw1-e1-codfw

lvs1017.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:48:30Z] <topranks> re-enabling pybal on lvs1017 after cable move T366361

All work complete. Both switches seem stable after the reboot. Moving the links around worked well, big thanks to @Jclark-ctr on site for the help on that.

Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely!