Upgrade Eqiad row E-F Spines to JunOS 22.2R3
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Fri, May 31, 3:31 PM

Description

Devices: ssw1-e1-eqiad, ssw1-f1-eqiad

When: Thur Jun 6th 2024, 12:00 UTC

Downtime: 60 minutes

As discussed in the parent task we need to upgrade all our QFX5120 devices in Ashburn to a more recent JunOS to overcome some bugs and use newer features.

This Spine-layer switches can mostly be done without affecting servers. The top-of-rack switches in each rack have links to both Spines, so we can upgrade one at a time without cutting comms to any rack. However we do have connections from LVS servers in remote rows (A-D) which land on the spine switches so the load-balancers can reach backend servers in rows E/F. (The list of hosts/services with backends in these rows can is here: P63779)

The LVS servers are connected as follows:

Host	Interface	Switch
lvs1017	enp94s0f0np0	ssw1-e1-eqiad
lvs1018	enp94s0f0np0	ssw1-e1-eqiad
lvs1019	enp94s0f0np0	ssw1-f1-eqiad
lvs1020	enp94s0f0np0	ssw1-f1-eqiad

lvs1020 is the backup LVS server for all the others. That means we cannot upgrade ssw1-f1-eqiad without a complete outage to all services fronted by lvs1019, as the reboot will also disrupt comms to rows E/F from the only backup, lvs1020.

While it might be possible to do ssw1-e1-eqiad by failing both lvs1017 and lvs1018 over to lvs1020 in advance, Traffic advise it is not a good idea to have all the requests those hosts handle re-routed to the single backup host.

Initially we had planned to depool eqiad in DNS, to move services away from the load-balancer layer in eqiad and allow for the disruption in comms. Having discussed with Service Ops, however, it seems this was somewhat unrealistic, and a full site switchover would be required instead. That seems unrealistic to perform to handle a 20 minute outage to two 10G ports in our DC, so we will instead move these links in advance as follows:

1. Move lvs1017 link

Disable PyBal on lvs1017, shifting traffic it handles to lvs1020
Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/32 to lsw1-e1-eqiad xe-0/0/8
Test connectivity to the row E/F vlans from lvs1017 following the cable move
Re-enable PyBal on lvs1017

2. Move lvs1018 link

Disable PyBal on lvs1018, shifting traffic it handles to lvs1020
Move single-mode fibre and SFP+ module from ssw1-e1-eqiad xe-0/0/33 to lsw1-e1-eqiad xe-0/0/9
Test connectivity to the row E/F vlans from lvs1018 following the cable move
Re-enable PyBal on lvs1018

3. Move lvs1019 link

Disable PyBal on lvs1019, shifting traffic it handles to lvs1020
Move single-mode fibre and SFP+ module from ssw1-f1-eqiad xe-0/0/32 to lsw1-f1-eqiad xe-0/0/8
Test connectivity to the row E/F vlans from lvs1019 following the cable move
Re-enable PyBal on lvs1019

In advance we need to pre-provision ports on our Leaf swithces as follows to temporarily terminate the fibres from the 3 live LVS hosts:

LVS	Temp Leaf Switch	Port	Copy Config From	Port
lvs1017	lsw1-e1-eqiad	xe-0/0/8	ssw1-e1-eqiad	xe-0/0/32
lvs1018	lsw1-e1-eqiad	xe-0/0/9	ssw1-e1-eqiad	xe-0/0/33
lvs1019	lsw1-f1-eqiad	xe-0/0/8	ssw1-f1-eqiad	xe-0/0/32

Once the Spine switches are upgraded we repeat the steps detailed above, moving the physical connections back to the Spine switches from the temporary lsw ports. And then default the Leaf switch tempoary ports.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		cmooney	T348977 Upgrade EVPN switches Eqiad row E-F to JunOS 22.2
		Resolved		cmooney	T366361 Upgrade Eqiad row E-F Spines to JunOS 22.2R3

Event Timeline

cmooney triaged this task as Medium priority.Fri, May 31, 3:31 PM

cmooney created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, May 31, 3:31 PM

cmooney updated the task description. (Show Details)Fri, May 31, 3:56 PM

cmooney added a parent task: T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2.Fri, May 31, 3:58 PM

cmooney mentioned this in T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2.

cmooney renamed this task from Upgrade ssw1-e1-eqiad to JunOS 22.2R3 to Upgrade Eqiad row E-F Spines to JunOS 22.2R3.Fri, May 31, 4:24 PM

cmooney updated the task description. (Show Details)

cmooney updated the task description. (Show Details)Fri, May 31, 4:27 PM

cmooney updated the task description. (Show Details)Fri, May 31, 4:29 PM

cmooney added a comment.Fri, May 31, 4:49 PM

This comment was removed by cmooney.

• OKJ04 added a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Mon, Jun 3, 1:39 AM

JJMC89 removed a subtask: T366433: CentralAuth tests broken unless you run them inside Quibble.Mon, Jun 3, 1:44 AM

cmooney updated the task description. (Show Details)Tue, Jun 4, 6:19 PM

cmooney updated the task description. (Show Details)Tue, Jun 4, 6:27 PM

CDanis subscribed.Tue, Jun 4, 6:35 PM

cmooney updated the task description. (Show Details)Wed, Jun 5, 8:49 AM

akosiaris updated the task description. (Show Details)Wed, Jun 5, 9:47 AM

cmooney updated the task description. (Show Details)Wed, Jun 5, 10:18 AM

@Jclark-ctr @VRiley-WMF unfortunately these switch upgrades require us to shift some cables around before/after the upgrade to avoid disrupting services.

We had planned to do it starting at 15:00 UTC tomorrow, Thursday Jun 6th, so 11am local time. Are either of you available at that time to assist? No worries if not, appreciate it's short notice, we planned this before we knew we'd need on-site assistance. If it doesn't suit please suggest an alternate time and we can do it then. Thanks.

@cmooney as it turns out, I will be out until June 10th.

In T366361#9863056, @VRiley-WMF wrote:

@cmooney as it turns out, I will be out until June 10th.

No probs, enjoy the time off. I'll see if maybe John can cover or otherwise we can do it after you're back. Thanks!

cmooney updated the task description. (Show Details)Wed, Jun 5, 2:32 PM

I spoke to @Jclark-ctr earlier, we will do this commencing at 12:00 UTC tomorrow Thurs 6th Jun.

Detailed steps are in P64182

Icinga downtime and Alertmanager silence (ID=54328f3a-52e5-42cd-bdf1-26ee5617a4d5) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: moving lvs1017 link to row E from spine to leaf

lvs1017.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T11:56:34Z] <topranks> disabling PyBal on lvs1017 to allow for cable move T366361

Icinga downtime and Alertmanager silence (ID=512f5f90-4832-4c61-b0eb-75b61fcd6f8c) set by cmooney@cumin1002 for 1:30:00 on 18 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

lsw1-e[1-3,5-7]-eqiad.mgmt,lsw1-f[1-3,5-7]-eqiad.mgmt,ssw1-e1-eqiad,ssw1-e1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=76763bfc-4091-4d8a-b3f8-e84d96a9bd49) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: moving lvs1018 link to row E from spine to leaf

lvs1018.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:25:39Z] <topranks> disabling PyBal on lvs1018 to allow for cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:33:53Z] <topranks> disabling BGP to ssw1-e1-eqiad from cr1-eqiad in advance of upgrade T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T12:44:55Z] <topranks> disabling PyBal on lvs1019 to allow for cable move T366361

The first phase of this is complete, ssw1-e1-eqiad has been upgraded.

I am going to pause before completing ssw1-f1-eqiad as some of the output is strange after the upgrade of E1. I believe it is just a change in how things are reported in the upgraded JunOS, but I want to allow a period of stability before we complete the other one just in case.

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:14:29Z] <topranks> disabling BGP on cr2-eqiad towards ssw1-f1-eqiad prior to upgrade of ssw later T366361

Icinga downtime and Alertmanager silence (ID=2e3e9f53-54b4-4b8d-b9d6-ab280392b41c) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-06T14:56:57Z] <topranks> disable ssw1-f1-eqiad leaf-facing ports in advance of upgrade T366361

Icinga downtime and Alertmanager silence (ID=e84998aa-eea9-43ce-9047-23b408d134b5) set by cmooney@cumin1002 for 1:30:00 on 15 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

lsw1-e[1-3,5-7]-eqiad.mgmt,lsw1-f[1-3,5-7]-eqiad.mgmt,ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=8ea52962-5718-4917-aeee-12b979b25d42) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: upgrading spine switches eqiad rows e and f

ssw1-e1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-06-06T15:29:51Z] <topranks> rebooting ssw1-f1-eqiad to install new JunOS release T366361

ssw1-f1-eqiad has now been upgraded to 22.2R3 also, no issues to report.

Mentioned in SAL (#wikimedia-operations) [2024-06-06T16:50:27Z] <topranks> disabling pybal on lvs1019 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=64a8433f-aaa7-4b28-a08b-f75a8455a6c9) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1019 link back to ssw1-f1-codfw

lvs1019.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:11:19Z] <topranks> re-enabling pybal on lvs1019 after cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:11:48Z] <topranks> disabling pybal on lvs1018 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=aade48b6-4a47-473b-8a5b-7069b1d13cce) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1018 link back to ssw1-e1-codfw

lvs1018.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:23:43Z] <topranks> re-enabling pybal on lvs1018 after cable move T366361

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:26:12Z] <topranks> disabling pybal on lvs1017 to move traffic to lvs1020 in advance of cable move T366361

Icinga downtime and Alertmanager silence (ID=13da9282-f3eb-4ff3-b775-5f7e24d4f1f9) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: moving lvs1017 link back to ssw1-e1-codfw

lvs1017.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-06T17:48:30Z] <topranks> re-enabling pybal on lvs1017 after cable move T366361

All work complete. Both switches seem stable after the reboot. Moving the links around worked well, big thanks to @Jclark-ctr on site for the help on that.

Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely!

Upgrade Eqiad row E-F Spines to JunOS 22.2R3Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Upgrade Eqiad row E-F Spines to JunOS 22.2R3
Closed, ResolvedPublic
Actions

Related Objects
Search...