Page MenuHomePhabricator

Replace RAID0 arrays with RAID10 on aqs100[456]
Closed, ResolvedPublic5 Story Points

Description

After a chat with Ops we decided to move away from RAID0 arrays for the Cassandra instances partitions and to replace them with RAID10 ones.

New architecture depicted in https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings

Since we have already loaded 4 months of data, we'll attempt to re-image one host at the time giving time to Cassandra to rebuild 2 instances at the time streaming data from the other ones.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2016, 8:10 AM

Already started with aqs1004, instance a has completed meanwhile instance b is still getting data from other ones. @JAllemandou checked on instance a that data schemas were correct and that settings were as expected, all good.

Nuria added a subscriber: Nuria.Aug 4 2016, 5:26 PM
  • Reimage the hosts
  • Restart the cassandra instances, they will not have the data but they will ask other instances for data (this process is happening on each host separately)

It took 2 days to do 1 instance (still compacting) , manual work was 2 hours.

Nuria changed the point value for this task from 0 to 5.Aug 4 2016, 5:26 PM

Mentioned in SAL [2016-08-08T09:34:48Z] <elukey> re-imaging aqs1005 to migrate Cassandra partitions to RAID10 (T142075)

TL;DR:

Today we decided to finish the hosts reimage and wipe the whole cluster to avoid possible data corruption issues.

Long explanation:

Starting status: 3 nodes, aqs100[456] with 6 cassandra instances aqs100[456]-[ab]

  • aqs1004 have been reimaged to deploy RAID10 arrays.
  • aqs100[56] were not touched, they were working fine.

List of things done:

  1. We have probably made a mistake bootstrapping two instances at the same time, aqs1004-(a|b), ending up in a weird state on the parent host. The SSTable sizes of the two cassandra instances were not similar, one was a lot bigger than the other one (~245GB vs ~360GB). We noticed that one of the instances, cassandra-b, had two times the number of keys of the other one, and this didn't make sense since each instance is supposed to get the same share of keys from the ring (via consistent hashing). We thought that the problem was due to bootstrap inconsistencies leading to "unused" keys on aqs1004-b, so we ran the following command to fix the problem:

aqs1004$ nodetool-b cleanup

  1. The command started to remove unused keys very nicely, and I thought to proceed anyway with the aqs1005 os reimage since the host was marked as up and running in the cluster. Moreover I waited two hours before starting, since I thought I wanted to observe the cleanup process. The final SStables on aqs1004-b though ended up to be ~170GB, a lot less than what we expected, namely ~240GB, so we started to be afraid about too much data deleted by the cleanup command.
  1. At this point the damage was done: with two instances down, due to aqs1005 being re-imaged, and one instance not reliable (aqs1004-b, the one with 170GB of data) we weren't able to repair sstables because of quorum failures. We decided to wipe the cluster to avoid any data weirdness.

Lesson learned:

  1. multiple instance bootstraps occurring together can lead to inconsistencies, never do that in production.
  2. nodetool cleanup seems not to be a super safe operation, and it must be run in isolation without performing any other cluster operations at the same time.
  3. Cassandra with a relatively small number of instances/nodes (6 and 3 respectively in our case) is very delicate and needs to be managed with extreme care. We need to prepare documents/howtos for various failure scenarios to prepare ourself and to avoid issues like today's one when the cluster will serve production traffic.

The cluster is now clean and RAID10 has been deployed everywhere.

elukey moved this task from Next Up to Done on the Analytics-Kanban board.Aug 9 2016, 7:35 AM
Nuria closed this task as Resolved.Aug 16 2016, 3:09 PM