Page MenuHomePhabricator

RESTBase k-r-v as Cassandra anti-pattern
Closed, ResolvedPublic

Description

One objective for RESTBase is to retain a complete history of transformations to power future alternative use-cases. However, the storage of large histories has created a number of challenges, so far met with efforts such as:

Cassandra 101

Cassandra is fundamentally a key-value store; Keys dictate placement in the cluster, or distribution, and are typically referred to as partitions. Values are sorted arrays of columns, that are indexed, and can be queried (with some limitations).

[ key ]( 1 )( 2 ) ... ( N )

Additionally, there are abstractions (closely coupled to CQL), that allow you to map a tabular structure onto what would otherwise be a flat sorted array of columns. Consider the following example:

-- Table to store tracks by-artist and album
CREATE TABLE music (
  artist text,
  album text,
  track text,
  PRIMARY KEY(artist, album, track)
);

The schema above has 3 attributes, artist, album, and track. artist is the first element of the PRIMARY KEY, which makes it the partition key, and each subsequent attribute establishes a many-to-one relationship to the value preceding it (so there can be many albums for each artist, and many tracks for each album).

INSERTing and SELECTing data works just as you would expect...

cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'As I Am');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'This Dying Soul');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Endless Sacrifice');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Honor Thy Father');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Vacant');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Stream of Consciousness');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'In the Name of God');
cqlsh:db> SELECT * FROM music WHERE artist = 'Dream Theater';

 artist        | album            | track
---------------+------------------+-------------------------
 Dream Theater | Train of Thought |                 As I Am
 Dream Theater | Train of Thought |       Endless Sacrifice
 Dream Theater | Train of Thought |        Honor Thy Father
 Dream Theater | Train of Thought |      In the Name of God
 Dream Theater | Train of Thought | Stream of Consciousness
 Dream Theater | Train of Thought |         This Dying Soul
 Dream Theater | Train of Thought |                  Vacant

(7 rows)
cqlsh:db>

...but it's important to realize that under the covers, it is still an array of column values bound to a single key/partition, something like:

[ Dream Theater ]( Train of Thought,As I Am )( Train of Thought,Endless Sacrifice )( ... )( Train of Thought,Vacant )

On disk

Cassandra uses a log-structure merge-tree. As values are written, data is periodically flushed to disk into immutable files (log-structured), and the data for a given partition can (and will be) persisted across multiple files. Reads collate results from applicable files on disk (merge-tree), and compaction operates asynchronously to combine the files to keep merge complexity bounded.

Since storage is log-structured, updates do not happen in place, a new record is appended with a more recent timestamp, and the merge-on-read reconciles what is current. Deletes are a special-case update, they insert a special marker, or tombstone to indicate any preceding values should not be considered in results. Additionally, a TTL property can be associated with a record to indicate its time-to-live, after which it becomes de facto tombstoned. In each of these cases, it falls to compaction to garbage-collect this obsolete/tombstoned data.

RESTBase 101

Now consider a (simplified) version of RESTBase's key-rev-value data-model:

CREATE TABLE data (
  key text,
  rev int,
  tid timeuuid,
  value blob,
  PRIMARY KEY (key, rev, tid)
);

In the schema above, key is the document title, rev the monotonically increasing revision (comes from mediawiki), tid a time-based UUID generated at the time of save, and value is the content being stored (parsoid-generated html, for example). An insertion happens for each new mediawiki revision, but also each time the content needs to be re-rendered (for example if a template edit would result in different parsoid output).

Revision retention policies

In an effort to somewhat limit the growth of partitions over time, we implemented what we call revision retention policies. Each time a new write completes, an async read is performed to query for some number of historical entries, a policy is applied to the results, and values which are excluded by policy are updated to include a TTL.

The Problems

Width of Partitions

Both the contrived data model above for storing audio tracks, and the RESTBase one, could be described as "unbounded". They are unbounded in the sense that a partition will continue to grow for as long as artists release new tracks, or for as many times as a given document is edited (and/or its content rerendered). This isn't necessary a bad thing if the practical limits on the growth of data are reasonable (i.e. how many tracks could a single artist release in their lifetime? 1,000s? 10,000s?). In the case of RESTBase, the values themselves are large (ranging anywhere from 10s of K, to tens of MB in size), and the number of values per-partition can grow very rapidly. Even after numerous efforts at culling data, we have seen partitions exceeding 30G(!) in size.

The closest thing to an Official recommendation on partition size, is the default at which Cassandra issues a warning, which is 100MB. This may be somewhat on the conservative side of things; 100s of MB is probably OK, (GBs is not).

The oft-cited issues with wide partitions (and we've experienced many of these):

  • Heap pressure on compaction
  • Heap pressure on query
  • High(er) query latency
  • Repairs inefficiencies (hashes are calculated by-partition)
  • Streaming inefficiencies (partitions are the smallest unit of streaming)
Cassandra is reasonably efficient at querying a portion of a partition. So long as a query doesn't require materializing an excessive number of results on heap, (query) memory utilization will be reasonable. Any strict upper bound on partition width comes from per-partition overheads, most significantly, row indexing. Some remedy for this may be possible (see: CASSANDRA-11206 and CASSANDRA-9754).

Revision Retention Policies

Revision retention policies were introduced in anger to address higher than expected storage growth after the initial rollout of RESTBase. Somewhat ironically, they create scaling issues of their own (albeit on a nearer-term basis).

As mentioned earlier, each write triggers a read, and least one additional write to apply the retention policy. The additional write sets a TTL which will eventually cause the record to be tombstoned. Subsequent reads must collate these tombstones with non-tombstoned data. With enough re-renders of a specific revision, the corresponding query (the read) can easily result in an overallocation and OOM exception (as witnessed here and here).

Tombstone GC

Another side-effect of the revision retention policy pattern, is the significant number of tombstones that result, and their distribution throughout the files on disk.

As you may recall, Cassandra garbage-collects tombstoned data during compaction. One issue with this process is the so-called overlapping tables problem. These compactions typically only involve a subset of the on-disk files, and tombstoned data can only be GC'd when the tables under consideration represent the whole picture; You cannot drop otherwise expired data if there is a possibility that older (preceding) values exist in other files (tables not under consideration of the current compaction).

Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition. Additionally, tombstoned data may be evicted if the maximum droppable value (the minimum timestamp for all partition-containing SSTables not under compaction) is greater than or equal to its timestamp. Since partitions in RESTBase run quite wide, the former is of little help in eliminating overlap (partitions tend to span all tables), and out-of-order writes generated by read-repair defeat the latter. This leaves us with partitions which remain wide due to uncollectable tombstoned data distributed throughout the partition.


The Big 3

I believe that, for all intents and purposes, RESTBase's use of Cassandra could be considered an anti-pattern, and I see 3 (broad) possible courses of action (in no particular order):

  • Rethink RESTBase's data model and access
  • Invest effort into solutions to make Cassandra better accommodate RESTBase's data model/access
  • Identify alternatives to Cassandra
For either of #2 or #3 above, I believe it is imperative that we explicitly establish up front how wide is wide enough, or if it is truly our objective to be able to store partitions of unlimited width.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

and TTL-containing records (or tombstones) can only be GC'd when the tables under consideration represent the whole picture

This is primarily an issue when data overlaps a lot of sstables, and never migrates into the same compaction. Leveled compaction partly avoids this problem by eventually compacting all related data in the lowest level. The cost is higher compaction load & troughput, as well as (in 2.1, at least) limitations on instance size. This is why we moved from LCS to DTCS last year.

Another option applicable to all compaction strategies could be to check other sstable's contents to determine collectability during compaction. This would increase read volume during compactions, but would allow tombstone removal even if not all sstables containing the partition key are part of the compaction.

Eevans triaged this task as Medium priority.Aug 31 2016, 8:23 PM

Hi @Eevans. Please associate at least one project with this task to allow others to find this task when searching in the corresponding project(s). Thanks! (Assuming this is RESTBase.)

it was discovered that exactly one file contained a repairedAt timestamp

So to me it sounds like the issue boils down to us a) having done an incremental repair a long time ago, and b) not having continued to do so.

To avoid the issue, we could do one of the following:

a) remove repairedAt from all sstables (tools/bin/sstablerepairedset –is-unrepaired <sstable>), and don't use incremental repairs, or
b) run incremental repairs at regular intervals.

In both cases, all older data would end up on one side of the anticompaction divide, enabling the collection of tombstones & avoiding this problem.

To me, option b) looks more appealing longer term than a). This would also address issues like T108611 and T92355.

Related: See this article for background on anticompaction in 2.1.

it was discovered that exactly one file contained a repairedAt timestamp

So to me it sounds like the issue boils down to us a) having done an incremental repair a long time ago, and b) not having continued to do so.

To avoid the issue, we could do one of the following:

a) remove repairedAt from all sstables (tools/bin/sstablerepairedset –is-unrepaired <sstable>), and don't use incremental repairs, or
b) run incremental repairs at regular intervals.

In both cases, all older data would end up on one side of the anticompaction divide, enabling the collection of tombstones & avoiding this problem.

To me, option b) looks more appealing longer term than a). This would also address issues like T108611 and T92355.

As it stands now, any normal compaction that does not include all tables containing a key, cannot GC any droppable tombstones for that key. As it stands now, the distribution of keys across tables is significant (something close to All Of Them), for a significant portion of our dataset (significant enough that I'm tempted to say All Of Them), making anything short of major compaction woefully inadequate. Running incremental repairs will only exacerbate this problem by creating two distinct compaction pools where files (by design) cannot be compacted with one another; Incremental repairs will widen the set of tables not under consideration for any given compaction.

As it stands now, any normal compaction that does not include all tables containing a key, cannot GC any droppable tombstones for that key.

That doesn't agree with how this comment by Marcus Eriksson describes the GC algorithm for 2.0, and there have been further improvements since. Unless that comment is incorrect, it appears that Cassandra does look at tables that are not part of the compaction itself to determine whether to drop a given tombstone. It might not actually look at per-row lifetimes to do this (to be verified for 2.2), but this is definitely different from "can't collect tombstones unless all sstables containing a row are part of a single compaction".

Running incremental repairs will only exacerbate this problem by creating two distinct compaction pools where files (by design) cannot be compacted with one another; Incremental repairs will widen the set of tables not under consideration for any given compaction.

Running incremental repairs nightly or so would mean that basically all but the very latest data would be marked as repaired, and thus end up on one side of the anticompaction divide. Considering that with our gc_grace and retention policy settings *all* GC-able content will be in the repaired sstable set, this should result in tombstones getting cleaned up. Another benefit of incremental repairs would be a drop in read repairs, and thus less out of order writes. This should make DTCS more effective, and further improve the grouping of sstables for compaction.

This presentation about compaction behavior in 2.1 by Aaron Morton expands a bit on Marcus' comment re maxPurgeableTimestamp in 2.1:

From slide 40:

pasted_file (359×638 px, 138 KB)

pasted_file (359×638 px, 129 KB)

pasted_file (359×638 px, 110 KB)

So it looks like the way Cassandra considers data in sstables outside the compaction has become quite fine-grained in 2.1.

Here is a source comment defining maxPurgeableTimestamp precisely:

/**
 * @return the largest timestamp before which it's okay to drop tombstones for the given partition;
 * i.e., after the maxPurgeableTimestamp there may exist newer data that still needs to be suppressed
 * in other sstables.  This returns the minimum timestamp for any SSTable that contains this partition and is not
 * participating in this compaction, or memtable that contains this partition,
 * or LONG.MAX_VALUE if no SSTable or memtable exist.
 */

If you run incremental repair regularly (every night or so), the data in the unrepaired set of sstables will (almost) always be newer than the data in the repaired sstables - this means that we will not block any compactions among the repaired sstables from dropping tombstones based on the sstables in the unrepaired set (ie, a tombstone in the repaired set will always be older than the data in the unrepaired set, and therefor cannot cover any data there). And since data will move quickly from unrepaired to repaired, this is not a problem in the unrepaired set either.

If you are running DTCS, you should probably migrate to TWCS as that is more robust

Let me know if there are any questions.

@Krummas, thanks for weighing in here!

Good point about TWCS as well. We have a task to investigate that at T133395.

Eevans updated the task description. (Show Details)

See T94121#2710479 for a summary of my earlier investigation of the wide row issue.

Shouldn't this awesome explanation be on wikitech for future reference? cc @Eevans

Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases.

Which future alternative use-cases? Where are these objectives and is this mandate stated somewhere? Is there a task/doc describing those?

Would this be a good topic to discuss during the upcoming Dev Summit?

https://phabricator.wikimedia.org/project/view/2205/

Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases.

Which future alternative use-cases? Where are these objectives and is this mandate stated somewhere? Is there a task/doc describing those?

Eric's wording here is a bit misleading. Ever since the original RFC, we have always made it clear that the eventual goal is to be able to store HTML, wikitext and metadata for all revisions. See for example T97692, T97710, and T100705, as well as https://www.mediawiki.org/wiki/Parsing#.5BDRAFT.5D_Long-term_directions_as_of_November_2016.

We deliberately started with a subset of revisions, as there were too many unknowns that required research. We have since established what is possible in terms of compression ratios and costs, and found instance scaling issues in Cassandra:

  • LSM write amplification grows with instance size.
  • Bounded-#sstables_per_read compaction strategy like LCS is not feasible with large instances. This contributes to / causes wide row performance issues with range queries.
  • Manual management of multiple instances per hardware nodes is hard to scale.
  • Repairs cause write amplification on the order of configured vnodes (256 by default).

We are now investigating solutions to these issues (see for example T125368).

Eric's wording here is a bit misleading. Ever since the original RFC, we have always made it clear that the eventual goal is to be able to store HTML, wikitext and metadata for all revisions. See for example T97692, T97710, and T100705, as well as https://www.mediawiki.org/wiki/Parsing#.5BDRAFT.5D_Long-term_directions_as_of_November_2016.

Great, good to know more details and context, thanks! It's worth pointing out, however, that from what you linked to:

  • T97692: quite controversial and we hardly reached a consensus; in fact, both you and @mark agreed there that this wasn't "the place to discuss high level roadmaps",
  • T97710: a RESTBase internal task, resolved without comments. If I recall correctly, we discussed this during a services/ops syncup and, similarly with the above, agreed that it wasn't the time or space to discuss this.
  • T100705: still unresolved, consensus not reached in the task (far from it)
  • T93751 (which you didn't mention): "Next steps for long-term revision storage", still unresolved
  • Original RESTBase RFC: I don't see anything relevant in the RfC nor do I remember discussing a longer-time archival strategy (correct me if I'm wrong, it's been a while)
  • Parsing team's November 2016: I don't see anything relevant in those notes; what exactly were you referring to?

So this is hardly a matter which we've agreed to broadly ­IMHO. To be clear, when I say "we", I don't mean SRE specifically. I do not see this -nor I want to make it- a services/ops debate; it's bigger than just us two.

The role of RESTBase has been ambiguous across the foundation since its inception; informal polling of engineers across technology and product often yield completely different responses, usually all across the spectrum of "it's a cache" - "it's a persistent permanent storage for revisions". Some of the proposed changes (of two separate clusters, I believe) are going to blur the picture even further. The choice of technology (Cassandra, evaluation of ScyllaDB etc.) are all affected quite a bit by the use cases, which are still, to this day, unclear (again IMHO).

I don't have a strong opinion on what the right process is, and I should say, it's ultimately your (Services') time, but my personal recommendation, in order to avoid further future friction and wasted effort across our teams, would be to try to propose your vision for this architecture and try to establish consensus for it in a broader forum. We've gone long enough without actually discussing this much, and it looks like we are at crossroads with regards to RESTBase's architectural future right now, so this sounds like a good time to do so.

Original RESTBase RFC: I don't see anything relevant in the RfC nor do I remember discussing a longer-time archival strategy (correct me if I'm wrong, it's been a while)

This RFC described a revision storage service as an example of a more general storage service abstraction. It was proposed as the first major service component to be split out of MediaWiki, as a first step toward a SOA architecture. The interfaces clearly describe access to arbitrary revisions, and the related research has always emphasized scaling to full history storage [1,2].

Parsing team's November 2016: I don't see anything relevant in those notes; what exactly were you referring to?

As you know, we have been gradually working towards semantic HTML as a storage and processing format for a while now. The parsing team recently updated their roadmap, and reinforced that they are planning to move towards a DOM-based processing model built around semantic HTML. Processing pipelines built around HTML will need efficient access to this HTML, for both current and past revisions. In this world, wikitext might move from the canonical storage format for our content to a textual user interface. There are many open questions still before we can make an informed decision about this direction, some of which are about large-scale revision storage under discussion here.

You are right that we finally need more clarity on our longer-term architectural direction and priorities. We have discussed several visions for a while now, but have not yet managed to agree on a clear direction. This discussion is an illustration of this issue. We need to resolve this soon, and I am optimistic that we will see movement on this in the near future.

Eevans renamed this task from RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) to RESTBase k-r-v as Cassandra anti-pattern.Dec 6 2016, 10:54 PM

Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tombstone in a file is greater.

@Eevans, I assume you mean "or if the minimum timestamp in a file is greater"?

Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tombstone in a file is greater.

@Eevans, I assume you mean "or if the minimum timestamp in a file is greater"?

The maximum droppable for a partition is the minimum timestamp for any file that contains the partition, and is not participating in the current compaction (or memtable). So both statements mean the same thing depending for what you are calling "file" here.

I'll tighten the wording.

You are right that we finally need more clarity on our longer-term architectural direction and priorities. We have discussed several visions for a while now, but have not yet managed to agree on a clear direction. This discussion is an illustration of this issue. We need to resolve this soon, and I am optimistic that we will see movement on this in the near future.

For the record, we discussed this a little bit at our regular services/ops weekly meeting. I think we are still in agreement that we need to have a wider conversation, across multiple teams, about the overall vision of our revision storage.

I am still of the opinion -and I don't think we agree, bu it's not clear to me- that we should have this conversation and figure out the current and future use cases before we embark into the redesign of RESTBase. I think designing RESTBase as a caching layer for temporary renderings (i.e a cache) vs. archiving all revisions in HTML (i.e. permanent storage, replacing External Store, like @GWicke mentioned) have vastly different requirements in terms of storage reliability and performance. In other words, I think we should first agree on what problem we are trying to solve.

Just to give an example: ScyllaDB is a ~1-year old product with not a lot of traction yet; migrating our only authoritative permanent copy of all revisions to such a product feels irresponsible to me, while it may be a worthy alternative for a cache we can wipe at any point in time. There are dozens of similar tradeoffs, the middle ground of which changes widly depending on the intended use cases.

My recommendation and preference at this point would be to not make a decision about this until we have the wider, more strategic conversation about the future of our revision storage.

@Eevans @mobrovac I believe this task has served it's purpose and can be closed?

@Eevans @mobrovac I believe this task has served it's purpose and can be closed?

We're not done referencing it, but I suppose that closing won't prevent us from continuing to do so. @mobrovac ?

Eevans lowered the priority of this task from Medium to Lowest.Jul 3 2018, 7:50 PM

Let's keep it open as a constant reminder (and for easy look-up). Moving it to the attic.

We're not using k-r-v any more, resolving.