(WARNING) WIP; Please stand-by...
As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.
## Option 1: Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.
### Queries
- The most current render of the most current revision (table: `current`)
- The most current render of a specific revision (table: `by_rev`)
- A specific render of a specific revision (table: `by_tid`)
NOTE: The latter two tables are only necessary to support VE concurrency; The `by_rev` and `by_tid` tables are only required to store historical/past versions for a reasonable period after being superseded.
### Algorithm
The latest content is always overwrittenData in the `current` table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the editmust be durable, as determined bybut the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolutioncontents of `by_rev` and `by_tid` can be ephemeral (should be, we can numerically add the revision number to the write time to guard against sub-second race conditions.
When an update arrives and a new render becomes the latest renderto prevent unbounded growth), the following procedure is applied:
- Current latest content is copied to the `by_rev` and `by_tid` tablelasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, and is written with a TTL.
either by a) copying the values on a read from `current`, - Additionallyor b) copying them on update, the new value is written to theprior to replacing a value in `current` table, overwriting the previous one. Neither of these strategies are ideal.
This ensures that any ongoing edits that were using the previous content of the `current` table will succeed because the content they depend on is stored for at the period of the TTL assigned.
If an edit of an older revision is madeFor non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, we check the `by_tid` table to ensure we have that revisioncopying from `current` for every new render, (and if necessary renew its TTL).regardless of the probability it will be edited, If the older revision is not in storagebut happens asynchronously without impacting user requests, it's generated by Parsoid and stored inand can be done reliably. This proposal uses the `by_tid` tablecopy-on-update approach.
The latest content is always overwritten in the `current` table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.
### Option 1a
Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.
#### Strawman Cassandra schemas
```lang=sql
-- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content
CREATE TABLE current (
"_domain" text,
title text,
value blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
### Options 1b
#### Strawman Cassandra schemas
```lang=sql
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev)
);
-- Same as Option 1a above
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
## Option 2: Retention policies using application-level TTLs
!!TODO!!
----
### Performance considerations
#### The `data` table
This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage.
However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate.
Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy.
More thorough testing (using less contrived workloads/data), in a more production-like environment is needed.
#### The `ttl` table
This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large.
That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see.
See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac).
____
## See also
- {T156209}