Page MenuHomePhabricator

ceph: Test behavior when an osd host goes down on codfw
Closed, ResolvedPublic

Description

Write the description below

Test and document what happens when an osd goes down, it's down for a while, and comes up on the codfw cluster.

Event Timeline

dcaro triaged this task as High priority.Aug 5 2021, 9:15 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-08-05T09:37:08Z] <dcaro> Taking one osd daemon down ot codfw cluster (T288203)

Tl;Dr;

It's ok to take an osd down, if the downtime is quick (less than 10min, configurable), it will not need to do any rebalancing, otherwise some rebalancing will have to happen, though the cluster is available during the whole time.
Note that rebalancing after bringing the OSD back up is way faster than when taking it down (as it reuses data).
Note 2, it was using the private interface to transfer data, though we don't collect metrics, I was manually monitoring on the host.

The test was:

  • Stop osd.2 from cloudcephosd2002-dev
  • Monitor the cluster, until everything is rebalanced
  • Bring the osd back up
  • Monitor again

The full log of 'ceph status' for the procedure:


Key points:

  • Stopping the osd:
root@cloudcephosd2002-dev:~# systemctl stop ceph-osd@2.service
root@cloudcephosd2002-dev:~# date
Thu 05 Aug 2021 09:41:37 AM UTC
  • Osd marked out, cluster in HEALTH_WARNING, degraded data redundancy: less than 20s after stopping the osd (more like in the next second).
  • Cluster starts rebalancing (600s after the osd is marked out, config): Thu 05 Aug 2021 09:51:42 AM UTC
  • HEALTH_OK (37m 35s after recovery started): Thu 05 Aug 2021 10:29:17 AM UTC
  • Cluster ends recovery traffic (37m 46s after recovery started): Thu 05 Aug 2021 10:29:28 AM UTC
  • OSD back up: Thu 05 Aug 2021 10:32:32 AM UTC
root@cloudcephosd2002-dev:~# date
Thu 05 Aug 2021 10:32:32 AM UTC
root@cloudcephosd2002-dev:~# systemctl start ceph-osd@2.service
  • OSD marked up (35s after service started): Thu 05 Aug 2021 10:33:07 AM UTC
  • Rebalancing started (3s after osd in): Thu 05 Aug 2021 10:33:10 AM UTC
  • Recovery started (2s after rebalancing started): Thu 05 Aug 2021 10:33:12 AM UTC
  • Recovery finished (2min 50s after recovery started): Thu 05 Aug 2021 10:35:02 AM UTC

Ceph osd df after the procedure shows little variance, and well balanced data:

root@cloudcephmon2002-dev:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE   VAR   PGS  STATUS
 0    ssd  0.87299   1.00000  894 GiB  172 GiB  171 GiB  2.5 MiB  1022 MiB  722 GiB  19.29  1.04  210      up
 1    ssd  0.87299   1.00000  894 GiB  159 GiB  158 GiB  2.8 MiB  1021 MiB  735 GiB  17.76  0.96  191      up
 2    ssd  0.87299   1.00000  894 GiB  156 GiB  155 GiB  2.7 MiB  1021 MiB  738 GiB  17.45  0.94  190      up
 3    ssd  0.87299   1.00000  894 GiB  175 GiB  174 GiB  1.4 MiB   1.2 GiB  719 GiB  19.62  1.06  211      up
 4    ssd  0.87299   1.00000  894 GiB  169 GiB  168 GiB  3.2 MiB  1021 MiB  725 GiB  18.89  1.02  202      up
 5    ssd  0.87299   1.00000  894 GiB  162 GiB  161 GiB  5.6 MiB  1018 MiB  732 GiB  18.16  0.98  199      up
                       TOTAL  5.2 TiB  994 GiB  988 GiB   18 MiB   6.2 GiB  4.3 TiB  18.53
MIN/MAX VAR: 0.94/1.06  STDDEV: 0.80

Just to doublecheck -- the takeaway from this is that when we need to do quick service on an osd node (e.g. drive replacement for T287838) we can just shutdown the host, service, switch back on, and Ceph will do something reasonable?

Just to doublecheck -- the takeaway from this is that when we need to do quick service on an osd node (e.g. drive replacement for T287838) we can just shutdown the host, service, switch back on, and Ceph will do something reasonable?

Yes, and even if it's not quick, ceph will manage, might take a bit more time and throughput if the downtime is longer, but will eventually get there.