Page MenuHomePhabricator

Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Open, LowPublic

Description

cloudsw1-c8-eqiad and cloudsw1-d5-eqiad are running JunOS 18.4R2-S4.10.

Opening this task to track upgrading them to JunOS 20+ to bring them into line with the other cloudsw devices (which are on 20.2 and 20.4).

Plan will be to upgrade each switch one by one. The 'cloudsw2' devices in each of these racks are daisy-chained from the respective cloudsw1 device in the same rack. So when we upgrade each all hosts in that rack will be offline for the duration of the work. Connectivity to hosts in other racks should remain up throughout.

In total the upgrade of each device should be in the region of 20-30 minutes during which all hosts in the rack will suffer a complete network outage. So we should do it under a maintenance window, and depool, prep or otherwise do what is required to minimize the impact. We should make sure the active cloudnet and cloudgw hosts are manually switched in advance also.

The hosts that will be affected are as follows:

Rack C8 (also including hosts in row B which connect via this switch):

cloudbackup1003
cloudcephmon1001
cloudcephmon1003
cloudcephosd1006
cloudcephosd1007
cloudcephosd1008
cloudcephosd1009
cloudcephosd1016
cloudcephosd1017
cloudcephosd1018
cloudcephosd1021
cloudcephosd1022
cloudgw1001
cloudnet1005
cloudlb1001
cloudvirt1025
cloudvirt1026
cloudvirt1027
cloudvirt1031
cloudvirt1032
cloudvirt1033
cloudvirt1034
cloudvirt1035
cloudvirt-wdqs1001
cloudvirt-wdqs1002
cloudvirt-wdqs1003

Rack D5:

cloudbackup1004
cloudcephmon1002 - no action needed (HA)
cloudcephosd1011 - to drain - ready
cloudcephosd1012 - to drain - ready
cloudcephosd1013 - to drain - ready
cloudcephosd1014 - to drain - ready
cloudcephosd1015 - to drain - ready
cloudcephosd1019 - to drain
cloudcephosd1020 - to drain
cloudcephosd1023 - to drain
cloudcephosd1024 - to drain
cloudgw1002 - no action needed (HA)
cloudnet1006 - no action needed (HA)
cloudlb1002
cloudvirt1028
cloudvirt1029
cloudvirt1030
cloudvirt1036
cloudvirt1037
cloudvirt1038
cloudvirt1039
cloudvirt1040
cloudvirt1041
cloudvirt1042
cloudvirt1043
cloudvirt1044
cloudvirt1045
cloudvirt1046
cloudvirt1047
cloudvirtlocal1001
cloudvirts

We need to move the VMs running on the cloudvirts to other hypervisors, but we can't move all of them, so we should move only the ones that are sensitive, the rest should be able to come back once the network is restored.

List of VMs to move to a different rack:
TBD

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
OpenNone
Opencmooney
Opendcaro
ResolvedRequestCmjohnson
ResolvedCmjohnson
Resolvednskaggs
ResolvedBUG REPORTdcaro
In Progressdcaro
Resolveddcaro
Opendcaro
Resolvedcmooney
Resolvednskaggs
Resolvednskaggs
Resolveddcaro
ResolvedRequestPapaul
Resolveddcaro
In Progressdcaro
OpenNone
Opendcaro
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:34:26Z] <wm-bot2> rebooting all the workers of tools k8s cluster (64 nodes) (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:35:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-88 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:37:09Z] <wm-bot2> rebooted k8s node tools-k8s-worker-87 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:38:21Z] <wm-bot2> rebooted k8s node tools-k8s-worker-86 (T316544) - cookbook ran by dcaro@vulcanus

We discovered today that our ceph setup can't handle the amount of nodes down that this operation requires. Please hold off.

I don't think that it's the amount of nodes being down the issue, the cluster should be more than fine without them. Will have to investigate what was the cause, probably a combination of load and ceph trying to replicate all the data at once making the operations too slow for the NFS to behave nicely.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:47:03Z] <wm-bot2> rebooted k8s node tools-k8s-worker-85 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:48:25Z] <wm-bot2> rebooted k8s node tools-k8s-worker-84 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T17:57:07Z] <wm-bot2> rebooted k8s node tools-k8s-worker-83 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:05:27Z] <wm-bot2> rebooted k8s node tools-k8s-worker-82 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:06:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-81 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:08:21Z] <wm-bot2> rebooted k8s node tools-k8s-worker-80 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:15:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-77 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:20:59Z] <wm-bot2> rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:22:20Z] <wm-bot2> rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus

We discovered today that our ceph setup can't handle the amount of nodes down that this operation requires. Please hold off.

I don't think that it's the amount of nodes being down the issue, the cluster should be more than fine without them. Will have to investigate what was the cause, probably a combination of load and ceph trying to replicate all the data at once making the operations too slow for the NFS to behave nicely.

Ok thanks for the info. Will reschedule for another date.

We could also shut the ceph facing ports slower, if that would be more gentle on the cluster (i.e. one every 20 mins for a few hours).

Of course the switch itself is just a piece of hardware, it could die at any point and cause this kind of outage. So better if the cluster / things that depend on it can deal with the sudden change.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:28:06Z] <wm-bot2> rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:34:17Z] <wm-bot2> rebooted k8s node tools-k8s-worker-73 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:39:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:42:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:44:19Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:49:57Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T18:56:07Z] <wm-bot2> rebooted k8s node tools-k8s-worker-68 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T19:04:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-67 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:24:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-30 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:32:27Z] <wm-bot2> rebooted k8s node tools-k8s-worker-31 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:36:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-32 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:42:10Z] <wm-bot2> rebooted k8s node tools-k8s-worker-33 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:47:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-34 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:48:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-35 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:49:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-36 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:50:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-37 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:52:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-38 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T20:58:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-39 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:03:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-40 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:09:49Z] <wm-bot2> rebooted k8s node tools-k8s-worker-41 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:12:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-42 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:13:54Z] <wm-bot2> rebooted k8s node tools-k8s-worker-43 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:15:13Z] <wm-bot2> rebooted k8s node tools-k8s-worker-44 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:16:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-45 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:37:30Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:38:53Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:40:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-49 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:41:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-50 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:42:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-51 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:44:40Z] <wm-bot2> rebooted k8s node tools-k8s-worker-52 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:47:08Z] <wm-bot2> rebooted k8s node tools-k8s-worker-53 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:49:28Z] <wm-bot2> rebooted k8s node tools-k8s-worker-54 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:50:38Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:51:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-56 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:52:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-57 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:54:47Z] <wm-bot2> rebooted k8s node tools-k8s-worker-58 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:55:58Z] <wm-bot2> rebooted k8s node tools-k8s-worker-59 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T21:58:22Z] <wm-bot2> rebooted k8s node tools-k8s-worker-60 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:02:04Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:04:00Z] <wm-bot2> rebooted k8s node tools-k8s-worker-62 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:06:35Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:07:50Z] <wm-bot2> rebooted k8s node tools-k8s-worker-65 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-15T22:09:37Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by andrew@bullseye

Mentioned in SAL (#wikimedia-cloud) [2023-05-16T08:07:54Z] <dcaro> reboot tools-sgebastion-10 (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:29:01Z] <wm-bot2> rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:42:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:45:39Z] <wm-bot2> rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:46:34Z] <wm-bot2> rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T07:54:44Z] <wm-bot2> rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:03:16Z] <wm-bot2> rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:10:26Z] <wm-bot2> rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:17:43Z] <wm-bot2> rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:25:52Z] <wm-bot2> rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:32:55Z] <wm-bot2> rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:33:56Z] <wm-bot2> rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T08:49:15Z] <wm-bot2> rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by dcaro@vulcanus

Change 920644 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: wmf-auto-restart: exclude NFS filesystems

https://gerrit.wikimedia.org/r/920644

Change 920648 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::auto_restarts: allow the systemd timer to not be installed

https://gerrit.wikimedia.org/r/920648

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-17T12:48:46Z] <wm-bot2> rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus

Change 947715 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable sftp-server

https://gerrit.wikimedia.org/r/947715

Change 953963 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Enable GNMI on cloudsw

https://gerrit.wikimedia.org/r/953963

I'm going to start draining nodes from D5:

cloudcephosd1011
cloudcephosd1012
cloudcephosd1013
cloudcephosd1014
cloudcephosd1015
cloudcephosd1019
cloudcephosd1020
cloudcephosd1023
cloudcephosd1024

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T08:44:38Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

I'm going to start draining nodes from D5:

@dcaro that's great thanks! Let me know when you have that done and we can co-ordinate with the rest of the team on a time for the upgrade/reboot.

In terms of the other nodes in rack D5 we have the following cloudvirts, and should consider possibly moving instances ahead of the upgrade:

cloudvirt1028
cloudvirt1029
cloudvirt1030
cloudvirt1036
cloudvirt1037
cloudvirt1038
cloudvirt1039
cloudvirt1040
cloudvirt1041
cloudvirt1042
cloudvirt1043
cloudvirt1044
cloudvirt1045
cloudvirt1046
cloudvirt1047
cloudvirtlocal1001

We also have the following nodes, which (once T346891 is complete) I believe aren't an issue as all of them have partners in other racks than can provide service during the outage:

cloudbackup1004
cloudcontrol1006
cloudgw1002
cloudlb1002
cloudnet1006
cloudservices1005

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:43:04Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:00Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:10Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T11:55:16Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.drain_node (T316544)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-02T13:37:32Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T316544)

aborrero updated the task description. (Show Details)

Unfortunately, it seems that the cluster has grown in the last few days :/, as draining the last 21 osd daemons would get it over 90% mean capacity (that will stop writes).

I'll undrain to check if the actual usage has increased, try to reduce it, but if we are over 67%, we will not able to take down a whole rack.