As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.
## Option 1: Table-per-query
This approach materializes views of results using distinct tables, each corresponding to a query.
### Queries
- The most current render of the most current revision (table: `current`)
- The most current render of a specific revision (table: `by_rev`)
- A specific render of a specific revision (table: `by_tid`)
### Algorithm
Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal.
For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach.
```lang=python, name=Update logic pseudo-code
# Read current latest values from 'current' table
latest_html = read('html', 'current')
latest_data_parsoid = read('data-parsoid', 'current')
latest_section_offsets = read('section_offsets', 'current')
# section_offsets ~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Copy the current latest, and new/proposed latest to the TTL-managed tables
write(new_latest_section_offsets, 'section_offsets', 'by_tid')
write(new_latest_section_offsets, 'section_offsets', 'by_rev')
write(latest_section_offsets, 'section_offsets', 'by_tid')
write(latest_section_offsets, 'section_offsets', 'by_rev')
# Write the new value to the current table last
write(new_latest_section_offsets, 'section_offsets', 'current')
# data-parsoid ~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Copy the current latest, and new/proposed latest to the TTL-managed tables
write(new_latest_data_parsoid, 'data-parsoid', 'by_tid')
write(new_latest_data_parsoid, 'data-parsoid', 'by_rev')
write(latest_data_parsoid, 'data-parsoid', 'by_tid')
write(latest_data_parsoid, 'data-parsoid', 'by_rev')
# Write the new value to the current table last
write(new_latest_data_parsoid, 'data-parsoid', 'current')
# html ~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Copy the current latest, and new/proposed latest to the TTL-managed tables
write(new_latest_html, 'html', 'by_tid')
write(new_latest_html, 'html', 'by_rev')
write(latest_html, 'html', 'by_tid')
write(latest_html, 'html', 'by_rev')
# Write the new value to the current table last
write(new_latest_html, 'html', 'current')
```
NOTE: There exists the possibility of a race whereby a VE read occurs between two concurrent updates, resulting in lost meta-data needed for the save. The pseudo code above solves for this by additionally writing the //new// value, in addition to copies of the current.
### Option 1a
Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements.
```lang=sql, name=Strawman Cassandra schema
-- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content
CREATE TABLE current (
"_domain" text,
title text,
value blob,
PRIMARY KEY ("_domain", title)
);
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
#### Issues/Drawbacks
- Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence)
- Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps)
- Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps)
- An abuse of the "tie-breaker mechanism"
- Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps)
- Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning)
- Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating
### Option 1b
Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`.
```lang=sql, name=Strawman Cassandra schema
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev)
);
-- Same as Option 1a above
CREATE TABLE by_rev (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev))
);
CREATE TABLE by_tid (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title, rev, tid))
);
```
```lang=sql, name=Example: Batched INSERT+DELETE
BEGIN BATCH
INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?);
DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000;
APPLY BATCH
```
NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0
NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT`
NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization.
#### Issues/Drawbacks
- Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]]
## Option 2: Retention policies using application-level TTLs
This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.
Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted.
Limiting renders is more challenging, since the revision is a monotonically increasing integer, without temporal context. As a result, an additional table is needed to establish this relationship, mapping timestamps to corresponding revisions. Records in the `timeline` table are keyed by domain (on the assumption that mediawiki sharding would never be more granular than this). Updates can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded partition growth.
```lang=sql, name=Strawman Cassandra schema
CREATE TABLE current (
"_domain" text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC);
CREATE TABLE timeline (
domain text,
ts timestamp,
rev int,
PRIMARY KEY(domain, ts)
) WITH CLUSTERING ORDER BY (rev DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy'}
AND default_time_to_live = 2592000; -- 30 days(?)
```
```lang=sql, name=Sample queries
-- Find a revision that corresponds to a timestamp outside of the retention period
SELECT ts, rev FROM timeline WHERE domain = ? AND timestamp < ? LIMIT 1;
-- When storing a new revision, a DELETE to clean up older revisions can be batched with the INSERT, (the
-- 'rev' predicate comes from the SELECT above).
BEGIN BATCH
INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, ?, ?);
DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev <= ?;
INSERT INTO timeline (domain, ts, rev) VALUES (?, ?, ?);
APPLY BATCH;
-- When storing a new render (extant revision), a DELETE to clean up older renders can be batched with the
-- INSERT, (the 'tid' predicate is synthesized using a timestamp derived from the one being inserted, minus
-- the duration of the retention period).
BEGIN BATCH
INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, ?, ?);
DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev = ? AND tid <= ?;
APPLY BATCH;
-- Latest
SELECT rev, tid, value FROM current WHERE "_domain" = ? AND title = ? LIMIT 1;
-- Latest render for a specific revision
SELECT tid, value FROM current WHERE "_domain" = ? AND title = ? AND rev = ? LIMIT 1;
-- Specific render
SELECT value FROM current WHERE "_domain" = ? AND title = ? AND rev = ? AND tid = ?;
```
NOTE: `DELETE`s of revisions or renders can be applied probabilistically
NOTE: Timeline `INSERT`s can be applied probabilistically
### Properties of `timeline` table
| wiki | edit frequency | 30d retention | 10d retention |
|------|------------------|-----------------|----------------|
| mkwikimedia | 1.3/day (0.000015046/s) | 39 | 3.9 |
| enwiki | 186497/day (2.16/s) | 5594910 | 1864970 |
- The number of domains is on the order of 100s
- Most domains have an edit frequency on the order of 100s per day
- 4 see an edit frequency on the order of 10s of K per day
- 2 see an edit frequency on the order of 100s of K per day
- Distribution on this table will be poor, but does it matter?
(NOTE) Source: P5383 (and P5390)
____
## See also
- {T156209}