Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases. Since storage is always bounded, this has created a number of challenges that have so far been met with efforts such as:
* Cluster expansions ([[ https://phabricator.wikimedia.org/T93790 | one executed ]], [[ https://phabricator.wikimedia.org/T139961 | one planned ]])
* [[ https://phabricator.wikimedia.org/T125904 | Compression ]]
* [[ https://phabricator.wikimedia.org/T140008 | Manual culling ]]
* Automatic culling (aka revision retention policies)
== The RESTBase data-model ==
=== Cassandra 101 ===
Cassandra is fundamentally a key-value store; Keys dictate placement in the cluster, or distribution, and are typically referred to as //partitions//. Values are sorted arrays of columns, that are indexed, and can be queried (with some limitations).
[ key ]( 1 )( 2 ) ... ( N )
Additionally, there are abstractions (closely coupled to CQL), that allow you to map a tabular structure onto what would otherwise be a flat sorted array of columns. Consider the following example:
-- Table to store tracks by-artist and album
CREATE TABLE music (
PRIMARY KEY(artist, album, track)
The schema above has 3 attributes, `artist`, `album`, and `track`. `artist` is the first element of the `PRIMARY KEY`, which makes it the partition key, and each subsequent attribute establishes a many-to-one relationship to the value preceding it (so there can be many `album`s for each `artist`, and many `track`s for each `album`).
`INSERT`ing and `SELECT`ing data works just as you would expect...
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'As I Am');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'This Dying Soul');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Endless Sacrifice');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Honor Thy Father');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Vacant');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Stream of Consciousness');
cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'In the Name of God');
cqlsh:db> SELECT * FROM music WHERE artist = 'Dream Theater';
artist | album | track
Dream Theater | Train of Thought | As I Am
Dream Theater | Train of Thought | Endless Sacrifice
Dream Theater | Train of Thought | Honor Thy Father
Dream Theater | Train of Thought | In the Name of God
Dream Theater | Train of Thought | Stream of Consciousness
Dream Theater | Train of Thought | This Dying Soul
Dream Theater | Train of Thought | Vacant
...but it's important to realize that on-disk, it is still an array of column values, something like:
[ Dream Theater ]( Train of Thought,As I Am )( Train of Thought,Endless Sacrifice )( ... )( Train of Thought,Vacant )
=== RESTBase 101 ===
Now consider a (simplified) version of RESTBase's key-rev-value data-model:
CREATE TABLE data (
PRIMARY KEY (key, rev, tid)
In the schema above, `key` is the document title, `rev` the monotonically increasing revision (comes from mediawiki), `tid` a time-based UUID generated at the time of save, and `value` is the content being stored (parsoid generated html, for example). An insertion happens for each new mediawiki revision, but also each time the content needs to be rerendered (for example if a template edit would result in different parsoid output).
The latter of these, //revision retention policies//, work by performing an asynchronous query of past records on each new write, and applying the policy to results. Records that are excluded by policy are overwritten with a TTL (to be GC'd when the TTL expires).
The GC of records with a TTL in Cassandra is something that happens as the result of compaction. One issue with this process is the so-called //[[ http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#compactions |overlapping tables problem]]//. Compactions typically only involve a subset of all on-disk files, and TTL-containing records (or tombstones) can only be GC'd when the tables under consideration represent the whole picture; You cannot drop otherwise expired data if there is a possibility that older (preceding) values exist in other files (tables //not// under consideration of the current compaction).
The RESTBase key-rev-value data model, combined with revision retention policies, create a kind of pathalogical worst-case for the problem of overlapping tables, and the GC of tombstoned data.
First, the history of each document is represented by a single partition, which means that the more history we have (and/or the greater the frequency of change), the more likely it is that it will appear across more files on disk (eventually all of them). Second, the use of revision retention policies ensure that tombstoned data is distributed throughout the document history.
Consider the following: `local_group_wikipedia_T_parsoid_html.data` on restbase1015-a.eqiad.wmnet had a droppable tombstone ratio of 90%(!!). After completion of a major compaction (where the expectation is that all files are compacted into one), the droppable tombstone ratio was still 81%. Upon further investigation, it was discovered that exactly one file contained a `repairedAt` timestamp, the result of experiments with incremental repair dating back to March of 2015. This meant that the previous compaction had actually resulted in two files, one for all repaired tables (of which there was already only one), and one for unrepaired tables (everything else). After removing the `repairedAt` field from the file and re-running the major compaction, the droppable tombstone ratio was less than 3%. //The exclusion of a single file from compaction was enough to cause the difference between an 81% ratio, to 3%.// Completion of this second major compaction created a corresponding decrease in storage from 507G to 225G, //a reduction of more than half//.
In addition to nominally doubling required storage, this "dead" data must be considered when merging for a read, so it impacts latency as well.
I believe that, for all intents and purposes, RESTBase's use of Cassandra could be considered an anti-pattern, and I see 3 (broad) possible courses of action:
# Rethink RESTBase's data model and access
# Invest effort into solutions to make Cassandra better accomodate RESTBase's data model/access
# Identify alternatives to Cassandra