As a sub-task of T120171, this task discusses steps towards storing only current revisions only, in a reliablele, low-maintenance, low-maintenance and low-latency manner.
## Option 1: Avoid tombstones with separate current revision & ttl tables
### Table schemas
```lang=sql
CREATE TABLE datacurrent (
"_domain" text,
title text,
revision int,
tid timeuuid,
html text,
data_parsoid text,
section_offsets textvalue blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE ttlby_rev (
"_domain" text,
time_window bigint,
title text,
revision int,
tid timeuuid,
html text,value blob,
PRIMARY KEY (("_domain", title, revision))
);
CREATE TABLE by_tid (
data_parso"_domain" text,
title text,
revision int,
tid textimeuuid,
section_offsets textvalue blob,
PRIMARY KEY (("_domain", time_window, title, revision), tid))
);
```
### Algorithm
Latest content is always overwritten in the data table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.
In case an update comes and a new render becomes the latest render, the following procedure is applied:
- Current latest content is copied to the `ttl` table with a TTL of 24 hours.
- The new render (revision) is written to the `data` table, overwriting the previous one.
This ensures that any ongoing edits that were using the previous content of the `data` table will succeed because the content they depend on is stored for another 24 hours.
If the edit is made to an older revision, we check the `ttl` table if we have that revision in there and potentially renew the TTL. If the older revision is not in storage, it's generated by Parsoid and stored in the `ttl` table.
### Implementation considerations
This could effectively be implemented as a revision-retention-policy in the scope of `restbase-mod-table-cassandra` module. This could have 3 modes of operation:
- If `grace_ttl=0` it works as a key_value storage always overwriting the newer content. We can just create the data table and avoid creating the checkout table.
- If the policy is TTL we just create the checkout table.
- Mixed mode - we create both tables.
If all use-cases for a revision-retention-policy could be fit into these 3 options, we can remove the revision retention policy that we have right now completely.
### Open questions
- Should HTML and data-parsoid be stored together or in the separate tables? What's the performance implications of this? Whats the complexity overhead of separating them?
- Should we just set the TTL globally on the `ttl` table?
### Performance considerations
#### The `data` table
This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage.
However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate.
Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy.
More thorough testing (using less contrived workloads/data), in a more production-like environment is needed.
#### The `ttl` table
This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large.
That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see.
NOTE: This is something else to consider when storing all 3 values in the same table, as the result size for each render would be larger, requiring fewer to be a problem, and therefore creating a lower re-render rate to be concerned with.
See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac).
## See also
- {T156209}