Page MenuHomePhabricator

Test/evaluate JBOD support
Closed, ResolvedPublic

Description

CASSANDRA-6696 introduced improved support for JBOD, likely making it tractable for us for the first time.

There are two potential benefits for us, a) moving to a bona fide JBOD configuration would eliminate our RAID0-induced blast radius and make any future vertical scaling easier, and b) it may prove to be a valuable way of improving key locality, and in turn read latency.

We should configure at least one host in the dev environment to use multiple data file directories, evaluate the result, and experiment with compaction settings in this configuration.

Event Timeline

One complication with JBOD is matching disks to instances. With our current hardware, we don't have the right number of disks to divide up into a reasonable number of instances per hardware node.

One option that could offer a reasonable compromise between robustness and flexibility might be to use LVM's linear RAID-0 (no striping), and then setting up one logical volume and filesystem per instance. Each instance's filesystem would be backed by 5/3 disks. Depending on which disk fails, a failure of a single disk would take out one or two instances, but should not take out all of them.

http://blu.org/pipermail/discuss/2006-December/027234.html discusses the behavior of LVM vs. md RAID-0 in the face of drive failures. At least LVM seems to be able to start up in a partial mode, which should allow continued operation on the unaffected filesystems.

In the last Ops/Services sync-up meeting we discussed using all of the disks for all of the instances, which would result in the following structure:

/srv/disk1/a
/srv/disk1/b
/srv/disk2/a
/srv/disk2/b
/srv/disk3/a
/srv/disk3/b

So each instance would have a directory (possibly a partition) on each disk. If Cassandra's JBOD support works as advertised, there should be any breakage if a disk fails, since each instance should be able to continue writing to the other directories (i.e. to disks that are fully operational). This hypothesis still needs to be tested, though.

One option that could offer a reasonable compromise between robustness and flexibility might be to use LVM's linear RAID-0 (no striping), and then setting up one logical volume and filesystem per instance.

This would come at a performance penalty though, would it not?

Each instance's filesystem would be backed by 5/3 disks. Depending on which disk fails, a failure of a single disk would take out one or two instances, but should not take out all of them.

I don't think I follow this, how would the allocation of disks, volumes, and instances look?

http://blu.org/pipermail/discuss/2006-December/027234.html discusses the behavior of LVM vs. md RAID-0 in the face of drive failures. At least LVM seems to be able to start up in a partial mode, which should allow continued operation on the unaffected filesystems.

The tone of this email (at least where the author talks of running degraded, or recovering from failure) is really speculative; This would require some careful testing, I think.

In the last Ops/Services sync-up meeting we discussed using all of the disks for all of the instances, which would result in the following structure:

/srv/disk1/a
/srv/disk1/b
/srv/disk2/a
/srv/disk2/b
/srv/disk3/a
/srv/disk3/b

So each instance would have a directory (possibly a partition) on each disk. If Cassandra's JBOD support works as advertised, there should be any breakage if a disk fails, since each instance should be able to continue writing to the other directories (i.e. to disks that are fully operational). This hypothesis still needs to be tested, though.

Yes, so to clarify: If we used the best_effort policy, then each instance would continue to serve what it had, would treat anything from an unreadable disk as if it were simply not there (which is OK if we're doing quorum or better reads), and I hope that it would flush new tables to working disks, but that needs to be verified/tested (everything is there to make it work that way, I just don't (yet) know that it does).

...and I hope that it would flush new tables to working disks, but that needs to be verified/tested (everything is there to make it work that way, I just don't (yet) know that it does).

It is indeed coded to work that way (see: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Directories.java#L456). We might still test this though.

Change 365081 had a related patch set uploaded (by Dzahn; owner: Eevans):
[operations/puppet@production] restbase: Configure an additional data file directory (dev)

https://gerrit.wikimedia.org/r/365081

Change 365081 merged by Dzahn:
[operations/puppet@production] restbase: Configure an additional data file directory (dev)

https://gerrit.wikimedia.org/r/365081

Mentioned in SAL (#wikimedia-operations) [2017-07-19T19:57:49Z] <urandom> Restarting Cassandra; restbase-dev1001-a to apply additional data_file_directory (T170276)

Change 368247 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Test additional data_file_directories

https://gerrit.wikimedia.org/r/368247

Change 368247 merged by Dzahn:
[operations/puppet@production] Test additional data_file_directories

https://gerrit.wikimedia.org/r/368247

This has been rolled out to the development and new RESTBase production environments as part of the work done in T169936: Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors., and is working as advertised. Closing this issue as complete.