Page MenuHomePhabricator

Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Open, LowPublic

Description

cloudsw1-c8-eqiad and cloudsw1-d5-eqiad are running JunOS 18.4R2-S4.10.

Opening this task to track upgrading them to JunOS 20+ to bring them into line with the other cloudsw devices (which are on 20.2 and 20.4).

Plan will be to upgrade each switch one by one. The 'cloudsw2' devices in each of these racks are daisy-chained from the respective cloudsw1 device in the same rack. So when we upgrade each all hosts in that rack will be offline for the duration of the work. Connectivity to hosts in other racks should remain up throughout.

In total the upgrade of each device should be in the region of 20-30 minutes. During this time all hosts in the rack will suffer a complete network outage. So we should do it under a maintenance window, and depool, prep or otherwise do what is required to minimize the impact. We should make sure the active cloudnet and cloudgw hosts are manually switched in advance also.

The hosts that will be affected are as follows:

Rack C8 (also including hosts in row B which connect via this switch):

cloudbackup1003
cloudcephmon1001
cloudcephmon1003
cloudcephosd1001
cloudcephosd1002
cloudcephosd1003
cloudcephosd1004
cloudcephosd1005
cloudcephosd1006
cloudcephosd1007
cloudcephosd1008
cloudcephosd1009
cloudcephosd1016
cloudcephosd1017
cloudcephosd1018
cloudcephosd1021
cloudcephosd1022
cloudgw1001
cloudnet1005
cloudvirt1017
cloudvirt1019
cloudvirt1020
cloudvirt1021
cloudvirt1022
cloudvirt1023
cloudvirt1024
cloudvirt1025
cloudvirt1026
cloudvirt1027
cloudvirt1031
cloudvirt1032
cloudvirt1033
cloudvirt1034
cloudvirt1035
cloudvirt-wdqs1001
cloudvirt-wdqs1002
cloudvirt-wdqs1003

Rack D5:

cloudbackup1004
cloudcephmon1002
cloudcephosd1010
cloudcephosd1011
cloudcephosd1012
cloudcephosd1013
cloudcephosd1014
cloudcephosd1015
cloudcephosd1019
cloudcephosd1020
cloudcephosd1023
cloudcephosd1024
cloudgw1002
cloudnet1006
cloudvirt1028
cloudvirt1029
cloudvirt1030
cloudvirt1036
cloudvirt1037
cloudvirt1038
cloudvirt1039
cloudvirt1040
cloudvirt1041
cloudvirt1042
cloudvirt1043
cloudvirt1044
cloudvirt1045
cloudvirt1046
cloudvirt1047

Related Objects

StatusSubtypeAssignedTask
OpenNone
StalledNone
OpenNone
OpenNone
OpenNone
Opencmooney
Opendcaro
ResolvedRequestCmjohnson
ResolvedCmjohnson
Resolvednskaggs
In ProgressBUG REPORTdcaro
In Progressdcaro
In Progressdcaro
Opendcaro
Opencmooney
Resolvednskaggs
Resolvednskaggs
Resolveddcaro
ResolvedRequestPapaul
Resolveddcaro
In Progressdcaro
OpenNone
Opendcaro

Event Timeline

cmooney created this task.

Folks I was considering doing these upgrades on the following dates:

cloudsw1-c8-eqiad - Monday February 13th
cloudsw1-d5-eqiad - Tuesday February 14th

If that gives enough time to get ready. If not we can do it the week starting February 27th or later in March.

I think those dates are fine, cc @dcaro -- let's discuss the best way to reduce impact on Ceph (downtime, norebalance,etc.). There are 10 cloudcephosds and 1 mon on each of those switches.

Folks I was considering doing these upgrades on the following dates:

cloudsw1-c8-eqiad - Monday February 13th
cloudsw1-d5-eqiad - Tuesday February 14th

Works for me too.

Folks I was considering doing these upgrades on the following dates:

cloudsw1-c8-eqiad - Monday February 13th
cloudsw1-d5-eqiad - Tuesday February 14th

Works for me too.

Cool let's see what @dcaro says when he's back.

We have a ton of rebalancing to do for each of these switches. The C8 deadline we can meet but can we get two weeks to shuffle our data around between C8 and D5? Otherwise our network will get saturated while we try to drain all those hypervisors and OSD nodes.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-02T12:57:13Z] <wm-bot2> Set the ceph cluster for eqiad1 in maintenance, alert silence ids: 7ac2b25a-d1bb-4789-8aa6-b9435b505349 (T316544) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2023-02-02T13:14:29Z] <dcaro_away> draining osd.48 from node cloudcephosd1001 (T316544)

We have a ton of rebalancing to do for each of these switches. The C8 deadline we can meet but can we get two weeks to shuffle our data around between C8 and D5? Otherwise our network will get saturated while we try to drain all those hypervisors and OSD nodes.

Sure no problem, definitely no need to rush it. Say Tues Feb 28th in that case?

cloudsw1-c8-eqiad - Monday February 13th
cloudsw1-d5-eqiad - Tuesday February 28th

So currently we can't take down all the osds on rack C8 (14), as we don't have enough space to allocate their data on the others.

We are looking at some options on how to get this going, but the main ones on our mind right now are:

  • Moving some hosts to other racks, to achieve "rack" HA (re-take T297083)
  • Move/connect some of the hosts to other switches and reconnect to the switch that reboots (as long as we do it one-by-one/two-by-two there should be no issue)
  • Trying to free some space, though we don't have high hopes that it would be enough

As I understand the work that is being done to distribute load better, allowing rack C8 to be offline for the upgrades, is going to take another while.

As such I will hold off on any plan for upgrade on Monday (Feb 13th).

Perhaps we could do:

cloudsw1-c8-eqiad - Thursday March 2nd
cloudsw1-d5-eqiad - Thursday March 9th

But no pressure, @dcaro let me know when you are ready and we can agree. Thanks.