Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Aug 29 2022, 1:46 PM

Description

cloudsw1-c8-eqiad and cloudsw1-d5-eqiad are running JunOS 18.4R2-S4.10.

Opening this task to track upgrading them to JunOS 20+ to bring them into line with the other cloudsw devices (which are on 20.2 and 20.4).

Plan will be to upgrade each switch one by one. The 'cloudsw2' devices in each of these racks are daisy-chained from the respective cloudsw1 device in the same rack. So when we upgrade each all hosts in that rack will be offline for the duration of the work. Connectivity to hosts in other racks should remain up throughout.

In total the upgrade of each device should be in the region of 20-30 minutes during which all hosts in the rack will suffer a complete network outage. So we should do it under a maintenance window, and depool, prep or otherwise do what is required to minimize the impact. We should make sure the active cloudnet and cloudgw hosts are manually switched in advance also.

The hosts that will be affected are as follows:

Rack `C8` (also including hosts in row B which connect via this switch):

T374043: Drain C8 rack

Rack `D5` (done):

T371878: [network,D5] reboot cloudsw-d5

cloudvirts

We need to move the VMs running on the cloudvirts to other hypervisors, but we can't move all of them, so we should move only the ones that are sensitive, the rest should be able to come back once the network is restored.

List of VMs to move to a different rack:
TBD

Details

Subject	Repo	Branch	Lines +/-
Enable GNMI on cloudsw	operations/homer/public	master	+1 -1
Enable sftp-server	operations/homer/public	master	+1 -0
cloud: wmf-auto-restart: exclude NFS filesystems	operations/puppet	production	+1 -1
profile::auto_restarts: allow the systemd timer to not be installed	operations/puppet	production	+11 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved		• ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved		• ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved		• ayounsi	T327862 Use mgmt_junos on all network devices
			Restricted Task
Open		None	T316539 Upgrade network devices to Junos 20+
Resolved		cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Resolved		dcaro	T297083 [ceph] Getting rack level HA
Resolved	Request	• Cmjohnson	T303058 hw troubleshooting: move cloudcephmon1003.eqiad.wmnet from rack B2 to rack C8
Resolved		• Cmjohnson	T304096 move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5
Resolved		• nskaggs	T329498 [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Resolved	BUG REPORT	dcaro	T329535 Cloud Ceph outage 2023-02-13
Resolved		dcaro	T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner
Resolved		dcaro	T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Open		dcaro	T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Resolved		cmooney	T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU
Resolved		• nskaggs	T329502 [ceph] Move cloudcephosd1003 (b2) to rack e4 and cloudcephosd1004 (c8) to rack f4
Resolved		• nskaggs	T329504 [ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4
Resolved		dcaro	T329507 [ceph] Test crush tree with rack level HA on codfw
Resolved	Request	Papaul	T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet
Resolved		dcaro	T331141 Change crushmap in eqiad to have rack HA
Resolved		dcaro	T331145 [cookbooks] adapt to having an extra level of buckets in the crushmap
Resolved		dcaro	T330733 Move coludcephmon1001 from B7 to rack F4
Open		dcaro	T331636 [cookbooks.ceph] create a script to get the list of rbd images affected by stuck/inactive PGs
Open		None	T336845 puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers
Resolved		Andrew	T371878 [network,D5] reboot cloudsw-d5
Resolved		dcaro	T372528 [ceph] Metrics started not responding during the drain
Resolved		dcaro	T374043 Drain C8 rack

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:44:19Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:49:57Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:56:07Z] <wm-bot2> rebooted k8s node tools-k8s-worker-68 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T19:04:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-67 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:24:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-30 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:32:27Z] <wm-bot2> rebooted k8s node tools-k8s-worker-31 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:36:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-32 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:42:10Z] <wm-bot2> rebooted k8s node tools-k8s-worker-33 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:47:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-34 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:48:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-35 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:49:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-36 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:50:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-37 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:52:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-38 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:58:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-39 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:03:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-40 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:09:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-41 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:12:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-42 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:13:54Z] <wm-bot2> rebooted k8s node tools-k8s-worker-43 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:15:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-44 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:16:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-45 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:37:30Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:38:53Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:40:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-49 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:41:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-50 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:42:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-51 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:44:40Z] <wm-bot2> rebooted k8s node tools-k8s-worker-52 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:47:08Z] <wm-bot2> rebooted k8s node tools-k8s-worker-53 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:49:28Z] <wm-bot2> rebooted k8s node tools-k8s-worker-54 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:50:38Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:51:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-56 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:52:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-57 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:54:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-58 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:55:58Z] <wm-bot2> rebooted k8s node tools-k8s-worker-59 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:58:22Z] <wm-bot2> rebooted k8s node tools-k8s-worker-60 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:02:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:04:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-62 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:06:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:07:50Z] <wm-bot2> rebooted k8s node tools-k8s-worker-65 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:09:37Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud) [2023-05-16T08:07:54Z] <dcaro> reboot tools-sgebastion-10 (T316544)

Mentioned in SAL (#wikimedia-cloud) [2023-05-16T08:08:04Z] <dcaro> reboot tools-mail-03 (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:29:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:42:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:45:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:46:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:54:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:03:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:10:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:17:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:25:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:32:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:33:56Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:49:15Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by dcaro@vulcanus

Change 920644 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: wmf-auto-restart: exclude NFS filesystems

https://gerrit.wikimedia.org/r/920644

gerritbot added a project: Patch-For-Review.May 17 2023, 9:12 AM

Change 920648 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::auto_restarts: allow the systemd timer to not be installed

https://gerrit.wikimedia.org/r/920648

aborrero added a subtask: T336845: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers.May 17 2023, 11:39 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T12:48:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus

Change 947715 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable sftp-server

https://gerrit.wikimedia.org/r/947715

Change 953963 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable GNMI on cloudsw

https://gerrit.wikimedia.org/r/953963

I'm going to start draining nodes from D5:

cloudcephosd1011
cloudcephosd1012
cloudcephosd1013
cloudcephosd1014
cloudcephosd1015
cloudcephosd1019
cloudcephosd1020
cloudcephosd1023
cloudcephosd1024

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T08:44:38Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

In T316544#9214514, @dcaro wrote:

I'm going to start draining nodes from D5:

@dcaro that's great thanks! Let me know when you have that done and we can co-ordinate with the rest of the team on a time for the upgrade/reboot.

dcaro updated the task description. (Show Details)Oct 2 2023, 9:12 AM

In terms of the other nodes in rack D5 we have the following cloudvirts, and should consider possibly moving instances ahead of the upgrade:

cloudvirt1028
cloudvirt1029
cloudvirt1030
cloudvirt1036
cloudvirt1037
cloudvirt1038
cloudvirt1039
cloudvirt1040
cloudvirt1041
cloudvirt1042
cloudvirt1043
cloudvirt1044
cloudvirt1045
cloudvirt1046
cloudvirt1047
cloudvirtlocal1001

We also have the following nodes, which (once T346891 is complete) I believe aren't an issue as all of them have partners in other racks than can provide service during the outage:

cloudbackup1004
cloudcontrol1006
cloudgw1002
cloudlb1002
cloudnet1006
cloudservices1005

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:43:04Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:00Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:10Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:16Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T13:37:32Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

aborrero updated the task description. (Show Details)Oct 2 2023, 3:58 PM

aborrero updated the task description. (Show Details)

dcaro updated the task description. (Show Details)Oct 3 2023, 8:58 AM

dcaro updated the task description. (Show Details)Oct 3 2023, 11:45 AM

dcaro updated the task description. (Show Details)Oct 4 2023, 7:12 AM

dcaro updated the task description. (Show Details)Oct 5 2023, 9:06 AM

dcaro updated the task description. (Show Details)Oct 5 2023, 3:06 PM

Unfortunately, it seems that the cluster has grown in the last few days :/, as draining the last 21 osd daemons would get it over 90% mean capacity (that will stop writes).

I'll undrain to check if the actual usage has increased, try to reduce it, but if we are over 67%, we will not able to take down a whole rack.

cmooney mentioned this in T365012: Add cloudsw to gnmic interface stats collection.May 15 2024, 2:01 PM

taavi moved this task from Unsorted to Network on the Cloud-VPS board.Jun 4 2024, 4:07 PM

dcaro closed subtask T297083: [ceph] Getting rack level HA as Resolved.Jul 10 2024, 1:26 PM

cmooney mentioned this in T371879: cloudsw1-d5-eqiad instability Aug 6 2024.Aug 6 2024, 10:03 AM

cmooney added a subtask: T371878: [network,D5] reboot cloudsw-d5.Aug 13 2024, 12:53 PM

fgiunchedi mentioned this in T372457: Remove librenms -> graphite integration, replace with gnmi.Aug 14 2024, 8:14 AM

Andrew closed subtask T371878: [network,D5] reboot cloudsw-d5 as Resolved.Aug 16 2024, 4:32 PM

aborrero mentioned this in T373986: cloudsw1-c8-eqiad is unstable.Sep 4 2024, 10:01 AM

dcaro added a subtask: T374043: Drain C8 rack.Sep 10 2024, 8:15 AM

@cmooney @VRiley-WMF Hi! I'm almost done draining the rack, we can try to find a slot starting next week to do the reboot.

@VRiley-WMF as we did with T371878: [network,D5] reboot cloudsw-d5, we will need you to be on standby with a replacement switch during the reboot, so you tell us when it works for you :), there's no rush so we can wait a couple weeks or so if you are going to be in the DC for anything else, and do the reboot then.

dcaro updated the task description. (Show Details)Sep 10 2024, 8:24 AM

Upgrade was successful today on cloudsw1-c8-codfw, the last of these we needed to do. Big thanks to the WMCS team for the hard work to move things around to allow it (and here's to a quiet and stable time with them from now on).

Change #947715 merged by jenkins-bot:

[operations/homer/public@master] Enable sftp-server

https://gerrit.wikimedia.org/r/947715

Change #953963 merged by jenkins-bot:

[operations/homer/public@master] Enable GNMI on cloudsw

https://gerrit.wikimedia.org/r/953963

dcaro closed subtask T374043: Drain C8 rack as Resolved.Nov 20 2024, 8:41 AM

Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+Closed, ResolvedPublicActions