ceph: Test behavior when an osd host goes down on codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Aug 5 2021, 9:15 AM

Description

Write the description below

Test and document what happens when an osd goes down, it's down for a while, and comes up on the codfw cluster.

Related Objects

Mentioned Here: T287838: Degraded RAID on cloudcephosd1008

Event Timeline

dcaro triaged this task as High priority.Aug 5 2021, 9:15 AM

dcaro created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 5 2021, 9:15 AM

Mentioned in SAL (#wikimedia-cloud) [2021-08-05T09:37:08Z] <dcaro> Taking one osd daemon down ot codfw cluster (T288203)

Tl;Dr;

It's ok to take an osd down, if the downtime is quick (less than 10min, configurable), it will not need to do any rebalancing, otherwise some rebalancing will have to happen, though the cluster is available during the whole time.
Note that rebalancing after bringing the OSD back up is way faster than when taking it down (as it reuses data).
Note 2, it was using the private interface to transfer data, though we don't collect metrics, I was manually monitoring on the host.

The test was:

Stop osd.2 from cloudcephosd2002-dev
Monitor the cluster, until everything is rebalanced
Bring the osd back up
Monitor again

The full log of 'ceph status' for the procedure:

test_logs.zip39 KBDownload

Key points:

Stopping the osd:

root@cloudcephosd2002-dev:~# systemctl stop ceph-osd@2.service
root@cloudcephosd2002-dev:~# date
Thu 05 Aug 2021 09:41:37 AM UTC

Osd marked out, cluster in HEALTH_WARNING, degraded data redundancy: less than 20s after stopping the osd (more like in the next second).

Cluster starts rebalancing (600s after the osd is marked out, config): Thu 05 Aug 2021 09:51:42 AM UTC
HEALTH_OK (37m 35s after recovery started): Thu 05 Aug 2021 10:29:17 AM UTC
Cluster ends recovery traffic (37m 46s after recovery started): Thu 05 Aug 2021 10:29:28 AM UTC
OSD back up: Thu 05 Aug 2021 10:32:32 AM UTC

root@cloudcephosd2002-dev:~# date
Thu 05 Aug 2021 10:32:32 AM UTC
root@cloudcephosd2002-dev:~# systemctl start ceph-osd@2.service

OSD marked up (35s after service started): Thu 05 Aug 2021 10:33:07 AM UTC
Rebalancing started (3s after osd in): Thu 05 Aug 2021 10:33:10 AM UTC
Recovery started (2s after rebalancing started): Thu 05 Aug 2021 10:33:12 AM UTC
Recovery finished (2min 50s after recovery started): Thu 05 Aug 2021 10:35:02 AM UTC

Ceph osd df after the procedure shows little variance, and well balanced data:

root@cloudcephmon2002-dev:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE   VAR   PGS  STATUS
 0    ssd  0.87299   1.00000  894 GiB  172 GiB  171 GiB  2.5 MiB  1022 MiB  722 GiB  19.29  1.04  210      up
 1    ssd  0.87299   1.00000  894 GiB  159 GiB  158 GiB  2.8 MiB  1021 MiB  735 GiB  17.76  0.96  191      up
 2    ssd  0.87299   1.00000  894 GiB  156 GiB  155 GiB  2.7 MiB  1021 MiB  738 GiB  17.45  0.94  190      up
 3    ssd  0.87299   1.00000  894 GiB  175 GiB  174 GiB  1.4 MiB   1.2 GiB  719 GiB  19.62  1.06  211      up
 4    ssd  0.87299   1.00000  894 GiB  169 GiB  168 GiB  3.2 MiB  1021 MiB  725 GiB  18.89  1.02  202      up
 5    ssd  0.87299   1.00000  894 GiB  162 GiB  161 GiB  5.6 MiB  1018 MiB  732 GiB  18.16  0.98  199      up
                       TOTAL  5.2 TiB  994 GiB  988 GiB   18 MiB   6.2 GiB  4.3 TiB  18.53
MIN/MAX VAR: 0.94/1.06  STDDEV: 0.80

dcaro closed this task as Resolved.Aug 5 2021, 10:59 AM

Just to doublecheck -- the takeaway from this is that when we need to do quick service on an osd node (e.g. drive replacement for T287838) we can just shutdown the host, service, switch back on, and Ceph will do something reasonable?

In T288203#7264486, @Andrew wrote:

Just to doublecheck -- the takeaway from this is that when we need to do quick service on an osd node (e.g. drive replacement for T287838) we can just shutdown the host, service, switch back on, and Ceph will do something reasonable?

Yes, and even if it's not quick, ceph will manage, might take a bit more time and throughput if the downtime is longer, but will eventually get there.

	F34585135: test_logs.zip
	Aug 5 2021, 10:57 AM

ceph: Test behavior when an osd host goes down on codfwClosed, ResolvedPublicActions

Description

Write the description below

Related Objects

Event Timeline

ceph: Test behavior when an osd host goes down on codfw
Closed, ResolvedPublic
Actions