(WARNING) WIP; Please stand-by...
As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.
## Option 1: Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.
### Queries
- The most current render of the most current revision (table: `current`)
- The most current render of a specific revision (table: `by_rev`)
- A specific render of a specific revision (table: `by_tid`)
### Algorithm
Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal.
For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach.
```lang=python, name=Update logic pseudo-code
# Read current latest value from 'current' table
current = read_current()
# Copy the values using the correct sequence
for i in (section_offsets, data-parsoid, html):
for j in (by_tid, by_rev, current):
store_current(i, j, current)
# On success, write the new current to 'current'
store_new_current(latest)
```
### Option 1a
Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.
```lang=sql, name=Strawman Cassandra schema
-- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content
CREATE TABLE current (
"_domain" text,
title text,
value blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
#### Issues/Drawbacks
- Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence)
- Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps)
- Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps)
- An abuse of the "tie-breaker mechanism"
- Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps)
- Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning)
- Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating
### Option 1b
Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`.
```lang=sql, name=Strawman Cassandra schema
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev)
);
-- Same as Option 1a above
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
```lang=sql, name=Example: Batched INSERT+DELETE
BEGIN BATCH
INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?);
DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000;
APPLY BATCH
```
NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0
NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT`
NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization.
#### Issues/Drawbacks
- Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]]
## Option 2: Retention policies using application-level TTLs
This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.
Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted.
Limiting renders is more challenging, since the revision is a monotonically increasing integer, without any temporal context.
```lang=sql, name=Strawman Cassandra schema
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC);
CREATE TABLE timeline (
domain text,
ts timestamp,
rev int,
ts timestamp,PRIMARY KEY(domain, ts)
) WITH CLUSTERING ORDER BY (rev DESC)
PRIMARY KEY(domain, rev)AND compaction = {'class': 'TimeWindowCompactionStrategy'}
); AND default_time_to_live = 2592000; -- 30 days(?)
```
____
## See also
- {T156209}