Page MenuHomePhabricator

Perform failover tests on Ceph storage cluster
Closed, ResolvedPublic

Description

Ceph has been deployed with all services configured for high availability. Each component should be failed over to ensure the cluster remains operational.

Test cases

  • monitor host failure
  • OSD (object storage daemon) host failure
  • manager service failure
  • OSD drive failure

Event Timeline

Bstorm triaged this task as Medium priority.Jan 22 2020, 10:28 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-07T20:42:05Z] <jeh> ceph: OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718

Mentioned in SAL (#wikimedia-operations) [2020-02-07T22:20:57Z] <jeh> ceph: round 2 OSD failover and recovery testing on cloudcephosd1003.wikimedia.org T240718

Validated full OSD server failure. During the outage, virtual machines and storage IO was degraded but fully operational and the storage cluster recovered on its own.

cluster health logs
cloudcephosd1001:~# ceph -s -w
  cluster:
    id:     5917e6d9-06a0-4928-827a-f489384975b1
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 (age 29m)
    mgr: cloudcephmon1002(active, since 5w), standbys: cloudcephmon1003, cloudcephmon1001
    osd: 24 osds: 24 up (since 43m), 24 in (since 43m)

  data:
    pools:   1 pools, 256 pgs
    objects: 1.68k objects, 6.5 GiB
    usage:   44 GiB used, 42 TiB / 42 TiB avail
    pgs:     256 active+clean


2020-02-07 22:21:25.379031 mon.cloudcephmon1001 [INF] osd.6 marked itself down
2020-02-07 22:21:25.379234 mon.cloudcephmon1001 [INF] osd.8 marked itself down
2020-02-07 22:21:25.379518 mon.cloudcephmon1001 [INF] osd.23 marked itself down
2020-02-07 22:21:25.379653 mon.cloudcephmon1001 [INF] osd.11 marked itself down
2020-02-07 22:21:25.380113 mon.cloudcephmon1001 [INF] osd.7 marked itself down
2020-02-07 22:21:25.381403 mon.cloudcephmon1001 [INF] osd.17 marked itself down
2020-02-07 22:21:25.382911 mon.cloudcephmon1001 [INF] osd.20 marked itself down
2020-02-07 22:21:25.403655 mon.cloudcephmon1001 [INF] osd.14 marked itself down
2020-02-07 22:21:25.433685 mon.cloudcephmon1001 [WRN] Health check failed: 8 osds down (OSD_DOWN)
2020-02-07 22:21:25.433743 mon.cloudcephmon1001 [WRN] Health check failed: 1 host (8 osds) down (OSD_HOST_DOWN)
2020-02-07 22:21:28.571886 mon.cloudcephmon1001 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 23 pgs peering (PG_AVAILABILITY)
2020-02-07 22:21:28.571946 mon.cloudcephmon1001 [WRN] Health check failed: Degraded data redundancy: 870/5040 objects degraded (17.262%), 132 pgs degraded (PG_DEGRADED)
2020-02-07 22:21:28.571972 mon.cloudcephmon1001 [WRN] Health check failed: too few PGs per OSD (25 < min 30) (TOO_FEW_PGS)
2020-02-07 22:21:31.960097 mon.cloudcephmon1001 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 3 pgs inactive, 23 pgs peering)
2020-02-07 22:21:35.998464 mon.cloudcephmon1001 [WRN] Health check update: Degraded data redundancy: 1680/5040 objects degraded (33.333%), 256 pgs degraded (PG_DEGRADED)
2020-02-07 22:21:35.998534 mon.cloudcephmon1001 [WRN] Health check update: too few PGs per OSD (21 < min 30) (TOO_FEW_PGS)
2020-02-07 22:22:28.698969 mon.cloudcephmon1001 [WRN] Health check update: Degraded data redundancy: 1680/5040 objects degraded (33.333%), 256 pgs degraded, 256 pgs undersized (PG_DEGRADED)
2020-02-07 22:23:39.736873 mon.cloudcephmon1001 [WRN] Health check update: 3 osds down (OSD_DOWN)
2020-02-07 22:23:39.736932 mon.cloudcephmon1001 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (8 osds) down)
2020-02-07 22:23:39.784294 mon.cloudcephmon1001 [INF] osd.6 [v2:208.80.154.153:6808/1838,v1:208.80.154.153:6809/1838] boot
2020-02-07 22:23:39.784398 mon.cloudcephmon1001 [INF] osd.7 [v2:208.80.154.153:6819/1883,v1:208.80.154.153:6821/1883] boot
2020-02-07 22:23:39.784457 mon.cloudcephmon1001 [INF] osd.8 [v2:208.80.154.153:6812/1839,v1:208.80.154.153:6813/1839] boot
2020-02-07 22:23:39.784503 mon.cloudcephmon1001 [INF] osd.11 [v2:208.80.154.153:6816/1865,v1:208.80.154.153:6817/1865] boot
2020-02-07 22:23:39.784545 mon.cloudcephmon1001 [INF] osd.14 [v2:208.80.154.153:6800/1837,v1:208.80.154.153:6801/1837] boot
2020-02-07 22:23:40.808201 mon.cloudcephmon1001 [WRN] Health check failed: Reduced data availability: 7 pgs peering (PG_AVAILABILITY)
2020-02-07 22:23:40.808261 mon.cloudcephmon1001 [WRN] Health check update: Degraded data redundancy: 1633/5040 objects degraded (32.401%), 249 pgs degraded, 249 pgs undersized (PG_DEGRADED)
2020-02-07 22:23:40.835172 mon.cloudcephmon1001 [INF] osd.17 [v2:208.80.154.153:6804/1836,v1:208.80.154.153:6805/1836] boot
2020-02-07 22:23:41.870390 mon.cloudcephmon1001 [INF] osd.23 [v2:208.80.154.153:6828/1945,v1:208.80.154.153:6829/1945] boot
2020-02-07 22:23:42.887785 mon.cloudcephmon1001 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2020-02-07 22:23:42.888056 mon.cloudcephmon1001 [WRN] Health check update: too few PGs per OSD (23 < min 30) (TOO_FEW_PGS)
2020-02-07 22:23:42.913094 mon.cloudcephmon1001 [INF] osd.20 [v2:208.80.154.153:6824/1898,v1:208.80.154.153:6825/1898] boot
2020-02-07 22:23:46.135110 mon.cloudcephmon1001 [WRN] Health check update: Degraded data redundancy: 281/5040 objects degraded (5.575%), 46 pgs degraded, 46 pgs undersized (PG_DEGRADED)
2020-02-07 22:23:46.135171 mon.cloudcephmon1001 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 7 pgs peering)
2020-02-07 22:23:46.135198 mon.cloudcephmon1001 [INF] Health check cleared: TOO_FEW_PGS (was: too few PGs per OSD (27 < min 30))
2020-02-07 22:23:48.690815 mon.cloudcephmon1001 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 281/5040 objects degraded (5.575%), 46 pgs degraded, 46 pgs undersized)
2020-02-07 22:23:48.690908 mon.cloudcephmon1001 [INF] Cluster is now healthy