RESTBase is playing a dual role of a) a REST API proxy, and b) storage and -caching for content. The largest chunk of storage by volume (about <n> TB) is HTML and metadata for wiki page revisions. So far, RESTBase has stored revisions rendered over the course of about 1/2 year. To stay within provisioned storage limitations, we have repeatedly culled revisions older than this manually.
During this period, we added several nodes to the cluster, and used Cassandra's horizontal scaling to evenly distribute the load across the cluster. While horizontal scaling and multi-DC operation have worked very well overall, we have however found some scaling challenges with how data is stored and managed in individual instances.
## Per-instance scaling challenges
### Indexing performance of wide rows
Cassandra distributes data at the granularity of "rows". In the current RESTBase data model, all revisions of a page title are mapped to a single row, which enables listings and efficient delta compression between revisions. While most pages have only a few dozen revisions (average is just above 20 IIRC), there are some pages that are edited very frequently. The way Cassandra indexes data for a single row locally does not scale very well, which causes performance problems for those extremely frequently edited pages. The details of this issue are discussed in T94121 and T144431.
### Performance degradation from uncollected tombstones
Cassandra models deletions using "tombstones", markers that record the fact that a bit of data was deleted. These are kept around for a limited time, typically until the corresponding data is actually removed in an asynchronous compaction process. Compactions are applied to a subset of the dataset, and tombstones can only be cleaned up if it can be ruled out that they still apply to data outside the specific compaction. See T144431 for more details on this issue.
### Log-structured merge tree write amplification grows with instance size
Cassandra uses log-structured merge trees to store data on disk, and organizes and cleans these up in an asynchronous compaction process. Each pass of this process rewrites data. The number of times live data is rewritten on average is called **write amplification**. There are different compaction strategies that determine which subset of the dataset is compacted at which time. While some of these strategies can minimize write amplification, they do so at the cost of not cleaning up in a timely manner. In general-purpose strategies like LCS or STCS, write amplification grows with the per-instance dataset size.
A high write amplification is bad for several reasons: First, it directly causes more disk writes, along with associated costs of SSD wear and increased latency. Second, compaction is an expensive process that involves decompressing and re-compressing a lot of data. The larger the write amplification is, the more CPU time is spent on this process. There are also a lot of allocations as part of the process, which causes higher garbage collection overheads in Java.
### Repairs
As a robust distributed system, Cassandra needs to deal with many failure scenarios. Cassandra goes to great lengths to provide the greatest possible availability at given consistency constraints, and implements mechanisms to heal short term failures on the order of hours. However, to absolutely make sure the cluster is consistent, it is recommended to also run a repair process at regular intervals. This process compares data stored on different replicas, and repairs any discrepancies it finds.
We have not been able to finish the initial repair process successfully. The reason seems to be a very high write amplification from rewriting data once for every virtual node configured. In the production cluster, this amounts to data being rewritten 256 times before a repair would be done. Cassandra does implement an incremental repair process that would apply the repair only to new data since the last run, but since this requires a full repair first, we have not been able to use this process in practice.
### GC and implementation scaling
Cassandra is implemented in Java, a garbage collected language. While the JVM has a world-class garbage collector, it still has limits on the size of heap it can GC with reasonable pauses. This makes it difficult to fully leverage modern cost-effective hardware with a single instance.
Additionally, we encountered bootstrapping failures when using leveled compaction with large instances > 600G of compressed storage.
## Manual multi-instance management
To address several of these per-instance scaling challenges, we set up multiple instances per hardware node (see T93790). Overall, this was successful in lowering latencies & letting us use the hardware more effectively.
However, the manual management of many instances is quite time consuming. Administrative tasks like bootstrapping, cleaning, or reconfiguring instances take a lot more time when they have to be applied per-instance rather than per node. We now have some tools to help with this, but need to invest more effort into automation to reduce the overheads associated with managing many instances further.
# Next steps
Broadly, there are several directions we can explore to address these issues, some of which can be combined.
- Improve the Cassandra setup with config changes & possibly code changes.
- Possible config changes: Consider returning to leveled compaction, disable read repair, increase the number of instances per hardware node.
- Explore alternative storage backends.
- In particular, the fairly new ScyllaDB project looks potentially interesting because of its Cassandra / RESTBase compatibility, and the potential to address some (but not all) of the challenges we have seen with Cassandra (T125368).
- Change the data model to avoid wide rows and range queries.
- Issues:
- Increased complexity.
- If done aggressively (one row per revision), loss of delta compression support & roughly 5x higher storage space needs.
### Evaluation process
In order to determine which (combination of) options provides the best return on investment, we are currently collecting information on the different options. We intend to summarize and weigh the results of this research here in the coming weeks.