(WARNING) WIP; Please stand-by...
As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.
## Option 1: Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.
### Queries
- The most current render of the most current revision (table: `current`)
- The most current render of a specific revision (table: `by_rev`)
- A specific render of a specific revision (table: `by_tid`)
### Algorithm
The latest content is always overwritten in the `current` table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.
When an update arrives and a new render becomes the latest render, the following procedure is applied:
- Current latest content is copied to the `by_rev` and `by_tid` tables, and is written with a TTL.
- Additionally, the new value is written to the `current` table, overwriting the previous one.
This ensures that any ongoing edits that were using the previous content of the `current` table will succeed because the content they depend on is stored for at the period of the TTL assigned.
If an edit of an older revision is made, we check the `by_tid` table to ensure we have that revision, (and if necessary renew its TTL). If the older revision is not in storage, it's generated by Parsoid and stored in the `by_tid` table.
### Option 1a
#### Strawman Cassandra schemas
```lang=sql
-- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content
CREATE TABLE current (
"_domain" text,
title text,
value blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
### Options 1b
#### Strawman Cassandra schemas
```lang=sql
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev)
);
-- Same as Option 1a above
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
## Option 2: Retention policies using application-level TTLs
!!TODO!!
----
### Performance considerations
#### The `data` table
This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage.
However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate.
Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy.
More thorough testing (using less contrived workloads/data), in a more production-like environment is needed.
#### The `ttl` table
This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large.
That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see.
See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac).
____
## See also
- {T156209}