Page MenuHomePhabricator

Evaluate ScyllaDB as a near-term replacement to Cassandra
Closed, ResolvedPublic

Description

Work-in-progress

Lets revisit Scylla once DTCS and encryption have landed. http://www.scylladb.com/technology/status/ is a good place to track progress.

Questions

Would Scylla allow us to return to leveled compaction?

Leveled compaction provided us the best latency by tightly bounding SSTables/read. However, high compaction throughput was a problem, as was an issue that only ever manifested under LCS which prevented bootstrapping (stream timeouts). Since Scylla partitions by core, each instance of compaction will be operating on a smaller subset of data. Would this, combined with the promise of more efficient CPU and IO utilization make LCS tractable for us again?

NOTE: This also ultimately depends on the efficacy of tombstone GC under LCS.

Can Scylla offer any respite from our troubles with tombstone GC?

Our data models make use of partitions of unbounded width. Additionally, our revision retention policies create significant numbers of tombstones distributed evenly throughout. Over time these partitions become distributed across all SSTables, creating an overlap that prevents compactions from garbage-collecting tombstoned data. Out-of-order writes aggravate this problem by preventing the per-file maximum purge-able tombstone from being used to rule-out such overlap, so having the capacity to repair (for example), might be one way in which Scylla could improve the situation. Having compaction applied to a smaller subset of data may also contribute to less overlap, and permit for more aggressive compaction (to fewer files).

Would repairs of our entire dataset by tractable under Scylla?

Complete anti-entropy repairs are expensive. They require significant CPU and disk IO for validation and anti-compaction. Inefficiencies become compounded as the amount of data under repair increases (write amplification, over-streaming, etc).

Would Scylla's handling of very wide partitions be an improvement over Cassandra's?

Our data models make use of partitions of unbounded width. We have seen high frequency edits of documents create 30G+ of history in a relatively short period (the cluster has been up less than 2 years, with several rounds of culling and targeted partition removals). These wide partitions create a number of problems, not least of which are when a queries attempt to collate too many results in memory, and the allocations result in OOM exceptions.

Data models like this are considered an anti-pattern in Cassandra, but Scylla's marketing would seem to claim that the improved performance can be enough to brute-force your way past this:

Raw speed isn’t just a nice benchmark or a low cloud bill. Scylla’s order of magnitude improvement in performance opens a wide range of possibilities. Instead of designing a complex data model to achieve adequate performance, use a straightforward data model, eliminate complexity, and finish your NoSQL project in less time with fewer bugs.

Performance improvements enable not just reduction in hardware resources, but also a reduction in architecture, debugging, and devops labor. Scylla combines the simple and familiar Cassandra architecture with new power to build a data model that fits the application, not tweak the model to fit the server.

If true, this could save us the work of remodeling RESTBase storage.

Missing in ScyllaDB (as of v1.4)

NOTE: This above list is not meant to be comprehensive; Contains only those items which might be relevant to our environment.

Testing

Environment / Requirements

  • 16G or 2G/lcore RAM (whichever is higher)
  • SSDs (RAID0)
  • XFS
  • 10 Gbps networking preferred

Experiments

The individual experiments that follow are meant to answer the questions above, and vet any theories and assumptions we may have. Unless stated otherwise, a constant for each is a workload that matches that of the RESTBase production environment. That is, both test data and the rate and disposition of requests should be proportional to what we see in production. This will require some work, since prior test methodologies haven't included much beyond performing html dumps.

Leveled compaction
TODO
Tombstone GC
TODO
Repair
TODO
Wide partitions
  1. Import the complete history of some very wide partitions
  2. Query for latest N results
  3. Page over entire partitions
NOTE: Ingestion should incorporate an apropos revision retention algorithm so that TTL-updated records are interleaved into the test data, in the same manner that they would be in production.
QUESTION: Are we able to reproduce (re)render history?
Metrics
  • Estimated droppable tombstone ratio
  • Memory consumption
  • Query latency
  • SSTables/read

Examples of exceptionally wide partitions:

  • local_group_wikipedia_T_parsoid_html/data:it.wikipedia.org:Utente\:Biobot/log (observed as high as 30G)
  • local_group_wikimedia_T_parsoid_html/data:commons.wikimedia.org:User\:Oxyman/Buildings_in_London/2014_October_11-20
  • local_group_wikipedia_T_parsoid_html/data:sv.wikipedia.org:Användare\:Lsjbot/Anomalier-PRIVAT

Event Timeline

Our data models make use of partitions of unbounded width.

We can define bounds on both partition and database size as a function of time. The issue is not about unbounded or not, but about hitting scaling issues fairly early, especially with a high number of SSTables hit per read.

Over time these partitions become distributed across all SSTables, creating an overlap that prevents compactions from garbage-collecting tombstoned data.

This only applies to compaction strategies that do not sort data by partition, thereby spreading it out across all sstables. It is not true for LCS.

Our data models make use of partitions of unbounded width.

We can define bounds on both partition and database size as a function of time.

Then we should, IMO.

The issue is not about unbounded or not, but about hitting scaling issues fairly early, especially with a high number of SSTables hit per read.

We have a structure, the partition, and it does not scale indefinitely (regardless of whether we're talking about Cassandra or Scylla). You could argue that this is an unacceptable property of these systems, but it is nonetheless true. We have a data model which enforces no constraints on the size and disposition of the data it stores in a partition. We want to write data to these partitions without enforcing constraints, until such a time that any encountered scaling issues are no longer considered "fairly early".

We can quibble on terminology, but so long as the problem can be paraphrased as: We want to store as many revisions as are generated, for a really long time, I'm going to continue thinking in terms of unbounded (or arbitrary, if you prefer).

Over time these partitions become distributed across all SSTables, creating an overlap that prevents compactions from garbage-collecting tombstoned data.

This only applies to compaction strategies that do not sort data by partition, thereby spreading it out across all sstables. It is not true for LCS.

This isn't true. LCS suffers from the overlapping tables problem and associated tombstone GC issues too. It stores data by non-overlapping ranges per level, tables can and will overlap between levels.

We can quibble on terminology, but so long as the problem can be paraphrased as: We want to store as many revisions as are generated, for a really long time, I'm going to continue thinking in terms of unbounded (or arbitrary, if you prefer).

@Eevans, my point is that the scaling issues we run into are about real-valued numbers, and not about "bounded" vs. "unbounded". We all know that no storage system can handle "unbounded" or "infinite" load. All engineering is firmly in the bounded and finite domain.

As engineers, our job is to estimate loads, scale etc, add some safety margin, and then evaluate possible solutions against those criteria. This task is no different. I proposed concrete values for the time horizon & update rates based on current edge cases in the email thread. I also suggested to add a safety margin. If you want to be sure that the bound will hold, make that safety margin 10x.

This is really no different from designing a bridge in San Francisco. You could argue that it's impossible because earthquakes and wind loads have no hard upper bound. Actual bridge engineers use statistics and safety factors to establish design parameters, and then go ahead and design that bridge.

This isn't true. LCS suffers from the overlapping tables problem and associated tombstone GC issues too. It stores data by non-overlapping ranges per level, tables can and will overlap between levels.

There are two parts to this:

First, the number of overlapping SSTables is bounded by the number of levels. With reasonable instance sizes, there are about 6 levels. Since there is a hard upper bound for the amount of overlap for a given partition, this gives us a low number of SSTables a partition can be spread out across.

The second part is about how data for a frequently-updated partition moves through these levels. New data is added at the topmost level. Eventually, more incoming data will trigger a promotion and compaction downwards to the next level. Because of the temporal locality between re-renders & corresponding deletes, the vast majority of tombstones from typical RESTBase updates (re-renders of the latest revision of a frequently re-rendered page) are likely to be GC'ed fairly early when propagating between the medium levels, as both the revision itself & its tombstone end up in the same level & thus SSTable. For rarely-rerendered pages, there are few tombstones in the first place, and even those are (eventually) gc'ed in the lowest level.

In contrast, time-based compaction strategies like TWCS produce a series of overlapping SSTables, one per time unit. Since there is no automatic compaction between time windows, tombstones pertaining to data outside its window won't be removed at all. Manual compactions can try to brute-force past this by periodically compacting old data into a single SSTable, but this is likely to cause more write amplification than LCS.

As engineers, our job is to estimate loads, scale etc, add some safety margin, and then evaluate possible solutions against those criteria. This task is no different. I proposed concrete values for the time horizon & update rates based on current edge cases in the email thread. I also suggested to add a safety margin. If you want to be sure that the bound will hold, make that safety margin 10x.

That wasn't clear to me (still isn't), but assuming this is the relevant part:

>> If you want a very conservative upper bound, we could take the largest page
>> by size in the database, and assume that this will be edited once per second
>> for the next 50 years. This is pretty far from anything realistic though, so
>> I would suggest to go with the largest / fastest-growing article we actually
>> have in the database. Wikipedia:Administrators'_noticeboard/Incidents or one
>> of the bot log pages should be decent candidates.
>
> On what horizon though?  If we take something like
> Wikipedia:Administrators'_noticeboard/Incidents, assume a constant
> size and edit frequency, and a single render per revision, for what
> period of time do we consider?

With any reasonable time horizon (say, current history plus 10 more years) and
normal engineering margins, the exact choice should not make or break a solution.

10 years of Wikipedia:Administrators'_noticeboard/Incidents would be ~1.4M columns and 286G, if we use a 10x safety margin, than we need to be able to accommodate partitions of 13,505,000 columns and 2.8T in size.

This is really no different from designing a bridge in San Francisco. You could argue that it's impossible because earthquakes and wind loads have no hard upper bound. Actual bridge engineers use statistics and safety factors to establish design parameters, and then go ahead and design that bridge.

Context might haven't gotten lost between this ticket and the mailing list thread, so let me try to reintroduce it.

When it comes to evaluating ScyllaDB's performance under wide rows (or for that matter, when it comes to evaluating alternative data models, or alternative storage technologies), we should not be testing against current figures, but against histories that we expect to be able to store. I want that upper bound (even if it is an estimate that includes a safety margin) in order to ensure we're testing for the right thing.

This isn't true. LCS suffers from the overlapping tables problem and associated tombstone GC issues too. It stores data by non-overlapping ranges per level, tables can and will overlap between levels.

There are two parts to this:

First, the number of overlapping SSTables is bounded by the number of levels. With reasonable instance sizes, there are about 6 levels. Since there is a hard upper bound for the amount of overlap for a given partition, this gives us a low number of SSTables a partition can be spread out across.

It only takes 1 overlapping table to disrupt tombstone GC. This isn't hypothetical either (in fact, it's common by my observations). This effect might be less pronounced in LCS in practice, but IME that's far from established.

The second part is about how data for a frequently-updated partition moves through these levels. New data is added at the topmost level. Eventually, more incoming data will trigger a promotion and compaction downwards to the next level. Because of the temporal locality between re-renders & corresponding deletes, the vast majority of tombstones from typical RESTBase updates (re-renders of the latest revision of a frequently re-rendered page) are likely to be GC'ed fairly early when propagating between the medium levels, as both the revision itself & its tombstone end up in the same level & thus SSTable. For rarely-rerendered pages, there are few tombstones in the first place, and even those are (eventually) gc'ed in the lowest level.

For any given tombstoned data, eventually could be a very long time (for all intents and purposes, never). Compaction throughput gets (should get) extremely low on the highest levels. Meanwhile, the closer that overlapping data gets to being resolved in a higher level, the more overlapping data accumulates in the lower levels.

In contrast, time-based compaction strategies like TWCS produce a series of overlapping SSTables, one per time unit. Since there is no automatic compaction between time windows, tombstones pertaining to data outside its window won't be removed at all. Manual compactions can try to brute-force past this by periodically compacting old data into a single SSTable, but this is likely to cause more write amplification than LCS.

I don't know about more write amplification (pretty sure that would not work out to be the case), but you're right, without intervention out-of-order writes are even less likely to be resolved on their own. However, I don't think user-defined compactions like the ones we're currently performing on TWCS are even an option with LCS.

Eevans renamed this task from Evaluate SycllaDB as a near-term replacement to Cassandra to Evaluate ScyllaDB as a near-term replacement to Cassandra.Dec 5 2016, 4:46 PM
Eevans updated the task description. (Show Details)

10 years of Wikipedia:Administrators'_noticeboard/Incidents would be ~1.4M columns and 286G, if we use a 10x safety margin, than we need to be able to accommodate partitions of 13,505,000 columns and 2.8T in size.

I assume that this is the raw data size, without any compression? In terms of scalability, I believe the number of columns per SSTable is more relevant, as it affects the size of SSTable indexes and compression metadata.

It only takes 1 overlapping table to disrupt tombstone GC.

With LCS, there are no overlaps within a single level. The only overlaps can be between levels, and those are resolved once the data is promoted to the lower level(s), where the SSTable size is a multiple of that of the level above. Given the temporal locality of re-renders of frequently-rerendered pages (each re-render of the latest revision deleting the previous one), and a high frequency of level promotions driven by frequent updates to the partition, data and corresponding tombstones will end up in the same compaction fairly quickly, where they will be GC'ed.

Very rarely edited and re-rendered articles don't suffer from major wide row issues, so a larger delay in tombstone GC there is not likely to cause observable performance issues.

Hmm, thinking about this some more, I now doubt that my earlier statement on overlaps is indeed correct:

The only overlaps can be between levels, and those are resolved once the data is promoted to the lower level(s), where the SSTable size is a multiple of that of the level above. Given the temporal locality of re-renders of frequently-rerendered pages (each re-render of the latest revision deleting the previous one), and a high frequency of level promotions driven by frequent updates to the partition, data and corresponding tombstones will end up in the same compaction fairly quickly, where they will be GC'ed.

While the part of overlaps within a level still holds, I now think that the chance of apparent overlaps between levels is fairly high. Assuming Cassandra uses only partition keys and minimum sstable timestamps to determine overlaps, sstables containing a partition are very likely to appear to overlap. With read repair disabled (not currently true for RESTBase) & writes using current timestamps (true for RESTBase), overlaps should only be downwards. This means tombstones for wide rows would be finally collected in the lowest level. While this might be relatively timely for the most frequently re-rendered articles, it could take longer for less frequently edited articles. The actual overwritten / deleted data for frequently re-rendered articles is collected more quickly, as it is very likely to end up in the same compaction as its tombstone early on due to temporal locality. This means that the amount of wasted space should remain fairly small.

The false overlap problem points back to the proposal I made in T94121#2710479:

Improve Cassandra by introducing some efficient range key summary. This summary would provide min & possibly max values per partition key, which in turn would let Cassandra rule out SSTables for range queries based on this information.

The same min/max information would also enable more efficient tombstone collections by avoiding false sstable overlaps. A similar effect could perhaps be achieved by scanning the actual range keys during compaction. This would avoid adding information to the index, but is more costly in terms of read IO.

In any case, disabling read repairs seems like a worthwhile thing to consider.

Just from an operational perspective, this would be a large undertaking, with questionable benefit (at some level we'd almost certainly be trading one set of problems for another).

Additionally, we've since committed to a storage strategy that requires range deletes (deletes using a non-equality operation for the predicate), pushing us further outside the list of Scylla compatible features; I believe we can rule out pursuing this in the near-term.