Page MenuHomePhabricator

Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Closed, ResolvedPublic

Description

cloudsw1-c8-eqiad and cloudsw1-d5-eqiad are running JunOS 18.4R2-S4.10.

Opening this task to track upgrading them to JunOS 20+ to bring them into line with the other cloudsw devices (which are on 20.2 and 20.4).

Plan will be to upgrade each switch one by one. The 'cloudsw2' devices in each of these racks are daisy-chained from the respective cloudsw1 device in the same rack. So when we upgrade each all hosts in that rack will be offline for the duration of the work. Connectivity to hosts in other racks should remain up throughout.

In total the upgrade of each device should be in the region of 20-30 minutes during which all hosts in the rack will suffer a complete network outage. So we should do it under a maintenance window, and depool, prep or otherwise do what is required to minimize the impact. We should make sure the active cloudnet and cloudgw hosts are manually switched in advance also.

The hosts that will be affected are as follows:

Rack C8 (also including hosts in row B which connect via this switch):

T374043: Drain C8 rack

Rack D5 (done):

T371878: [network,D5] reboot cloudsw-d5

cloudvirts

We need to move the VMs running on the cloudvirts to other hypervisors, but we can't move all of them, so we should move only the ones that are sensitive, the rest should be able to come back once the network is restored.

List of VMs to move to a different rack:
TBD

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved ayounsi
Resolved ayounsi
Resolved ayounsi
OpenNone
Resolvedcmooney
Resolveddcaro
ResolvedRequest Cmjohnson
Resolved Cmjohnson
Resolved nskaggs
ResolvedBUG REPORTdcaro
Resolveddcaro
Resolveddcaro
Opendcaro
Resolvedcmooney
Resolved nskaggs
Resolved nskaggs
Resolveddcaro
ResolvedRequestPapaul
Resolveddcaro
Resolveddcaro
Resolveddcaro
Opendcaro
OpenNone
ResolvedAndrew
Resolveddcaro
Resolveddcaro

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:44:19Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:49:57Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:56:07Z] <wm-bot2> rebooted k8s node tools-k8s-worker-68 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T19:04:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-67 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:24:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-30 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:32:27Z] <wm-bot2> rebooted k8s node tools-k8s-worker-31 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:36:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-32 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:42:10Z] <wm-bot2> rebooted k8s node tools-k8s-worker-33 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:47:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-34 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:48:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-35 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:49:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-36 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:50:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-37 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:52:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-38 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:58:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-39 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:03:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-40 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:09:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-41 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:12:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-42 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:13:54Z] <wm-bot2> rebooted k8s node tools-k8s-worker-43 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:15:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-44 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:16:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-45 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:37:30Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:38:53Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:40:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-49 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:41:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-50 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:42:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-51 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:44:40Z] <wm-bot2> rebooted k8s node tools-k8s-worker-52 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:47:08Z] <wm-bot2> rebooted k8s node tools-k8s-worker-53 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:49:28Z] <wm-bot2> rebooted k8s node tools-k8s-worker-54 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:50:38Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:51:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-56 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:52:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-57 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:54:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-58 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:55:58Z] <wm-bot2> rebooted k8s node tools-k8s-worker-59 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:58:22Z] <wm-bot2> rebooted k8s node tools-k8s-worker-60 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:02:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:04:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-62 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:06:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:07:50Z] <wm-bot2> rebooted k8s node tools-k8s-worker-65 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:09:37Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud) [2023-05-16T08:07:54Z] <dcaro> reboot tools-sgebastion-10 (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:29:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:42:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:45:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:46:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:54:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:03:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:10:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:17:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:25:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:32:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:33:56Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:49:15Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by dcaro@vulcanus

Change 920644 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: wmf-auto-restart: exclude NFS filesystems

https://gerrit.wikimedia.org/r/920644

Change 920648 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::auto_restarts: allow the systemd timer to not be installed

https://gerrit.wikimedia.org/r/920648

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T12:48:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus

Change 947715 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable sftp-server

https://gerrit.wikimedia.org/r/947715

Change 953963 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable GNMI on cloudsw

https://gerrit.wikimedia.org/r/953963

I'm going to start draining nodes from D5:

cloudcephosd1011
cloudcephosd1012
cloudcephosd1013
cloudcephosd1014
cloudcephosd1015
cloudcephosd1019
cloudcephosd1020
cloudcephosd1023
cloudcephosd1024

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T08:44:38Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

I'm going to start draining nodes from D5:

@dcaro that's great thanks! Let me know when you have that done and we can co-ordinate with the rest of the team on a time for the upgrade/reboot.

In terms of the other nodes in rack D5 we have the following cloudvirts, and should consider possibly moving instances ahead of the upgrade:

cloudvirt1028
cloudvirt1029
cloudvirt1030
cloudvirt1036
cloudvirt1037
cloudvirt1038
cloudvirt1039
cloudvirt1040
cloudvirt1041
cloudvirt1042
cloudvirt1043
cloudvirt1044
cloudvirt1045
cloudvirt1046
cloudvirt1047
cloudvirtlocal1001

We also have the following nodes, which (once T346891 is complete) I believe aren't an issue as all of them have partners in other racks than can provide service during the outage:

cloudbackup1004
cloudcontrol1006
cloudgw1002
cloudlb1002
cloudnet1006
cloudservices1005

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:43:04Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:00Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:10Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:16Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T13:37:32Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

aborrero updated the task description. (Show Details)

Unfortunately, it seems that the cluster has grown in the last few days :/, as draining the last 21 osd daemons would get it over 90% mean capacity (that will stop writes).

I'll undrain to check if the actual usage has increased, try to reduce it, but if we are over 67%, we will not able to take down a whole rack.

@cmooney @VRiley-WMF Hi! I'm almost done draining the rack, we can try to find a slot starting next week to do the reboot.

@VRiley-WMF as we did with T371878: [network,D5] reboot cloudsw-d5, we will need you to be on standby with a replacement switch during the reboot, so you tell us when it works for you :), there's no rush so we can wait a couple weeks or so if you are going to be in the DC for anything else, and do the reboot then.

Upgrade was successful today on cloudsw1-c8-codfw, the last of these we needed to do. Big thanks to the WMCS team for the hard work to move things around to allow it (and here's to a quiet and stable time with them from now on).

Change #947715 merged by jenkins-bot:

[operations/homer/public@master] Enable sftp-server

https://gerrit.wikimedia.org/r/947715

Change #953963 merged by jenkins-bot:

[operations/homer/public@master] Enable GNMI on cloudsw

https://gerrit.wikimedia.org/r/953963