(WARNING) WIP; Please stand-by...
As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.
## Option 1: Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.
### Queries
- The most current render of the most current revision (table: `current`)
- The most current render of a specific revision (table: `by_rev`)
- A specific render of a specific revision (table: `by_tid`)
NOTE: The latter two tables are only necessary to support VE concurrency; The `by_rev` and `by_tid` tables are only required to store historical/past versions for a reasonable period after being superseded.
### Algorithm
Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal.
For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach.
# Value is read from `current` table
# Value is copied to `by_tid` table
# Value is copied to `by_rev` table
# If 2 and 3 are successful, new value is written to `current`
### Option 1a
Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.
#### Strawman Cassandra schemas
```lang=sql
-- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content
CREATE TABLE current (
"_domain" text,
title text,
value blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
### Options 1b
#### Strawman Cassandra schemas
This is identical to that of 1a above, with the exception of how the `current` table is implemented.
```lang=sql
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev)
);
-- Same as Option 1a above
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
## Option 2: Retention policies using application-level TTLs
!!TODO!!
----
### Performance considerations
#### The `data` table
This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage.
However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate.
Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy.
More thorough testing (using less contrived workloads/data), in a more production-like environment is needed.
#### The `ttl` table
This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large.
That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see.
See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac).
____
## See also
- {T156209}