Change Details

As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ```lang=python, name=Update logic pseudo-code # Read current latest value from 'current' table current = read_current() # Copy the values using the correct sequence for i in (section_offsets, data-parsoid, html): for j in (by_tid, by_rev, current): store_current(i, j, current) # On success, write the new current to 'current' store_new_current(latest) ``` This algorithm suffers from race conditions on concurrent updates and VE edits. Here's an example: update 1 reads, update 2 reads, update 1 writes to all tables, overwrites the latest content, VE comes and takes the latest content, update 2 finishes and overwrites everything. In this scenario the content that VE got is never stashed, so the edit will be broken. With introducing separate tables for data-parsoid the complexity and the number of scenarios where the race is possible only grows. ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. ```lang=sql, name=Strawman Cassandra schema -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` #### Issues/Drawbacks - Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence) - Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps) - Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps) - An abuse of the "tie-breaker mechanism" - Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps) - Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning) - Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating ### Option 1b Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ```lang=sql, name=Example: Batched INSERT+DELETE BEGIN BATCH INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?); DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000; APPLY BATCH ``` NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0 NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT` NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization. #### Issues/Drawbacks - Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]] ## Option 2: Retention policies using application-level TTLs This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention. Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted. Limiting renders is more challenging, since the revision is a monotonically increasing integer, without temporal context. As a result, an additional table is needed to establish this relationship, mapping timestamps to corresponding revisions. Records in the `timeline` table are keyed by domain (on the assumption that mediawiki sharding would never be more granular than this). Updates can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded partition growth. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC); CREATE TABLE timeline ( domain text, ts timestamp, rev int, PRIMARY KEY(domain, ts) ) WITH CLUSTERING ORDER BY (rev DESC) AND compaction = {'class': 'TimeWindowCompactionStrategy'} AND default_time_to_live = 2592000; -- 30 days(?) ``` ```lang=python, name=Pseudo code # Look for a revision that corresponds with a timestamp at least 86400 seconds in the past target_rev = get_target_rev(domain, timestamp_for(tid) - 86400000) # If a target revision was found, range delete to cull past revisions older-than retention period if target_rev: batch( store_new_current(domain, title, rev, tid, value), delete_older_than(domain, title, target_rev), ) update_timeline(domain, timestamp_for(tid)) # If a target revision was NOT found, skip the range delete for this iteration else: store_new_current(domain, title, rev, tid, value) update_timeline(domain, timestamp_for(tid)) ``` ____ ## See also - {T156209}

As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ```lang=python, name=Update logic pseudo-code # Read current latest value from 'current' table current = read_current() # Copy the values using the correct sequence for i in (section_offsets, data-parsoid, html): for j in (by_tid, by_rev, current): store_current(i, j, current) # On success, write the new current to 'current' store_new_current(latest) ``` This algorithm suffers from race conditions on concurrent updates and VE edits. Here's an example: update 1 reads, update 2 reads, update 1 writes to all tables, overwrites the latest content, VE comes and takes the latest content, update 2 finishes and overwrites everything. In this scenario the content that VE got is never stashed, so the edit will be broken. With introducing separate tables for data-parsoid the complexity and the number of scenarios where the race is possible only grows. ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. ```lang=sql, name=Strawman Cassandra schema -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` #### Issues/Drawbacks - Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence) - Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps) - Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps) - An abuse of the "tie-breaker mechanism" - Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps) - Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning) - Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating ### Option 1b Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ```lang=sql, name=Example: Batched INSERT+DELETE BEGIN BATCH INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?); DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000; APPLY BATCH ``` NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0 NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT` NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization. #### Issues/Drawbacks - Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]] ## Option 2: Retention policies using application-level TTLs This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention. Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted. Limiting renders is more challenging, since the revision is a monotonically increasing integer, without temporal context. As a result, an additional table is needed to establish this relationship, mapping timestamps to corresponding revisions. Records in the `timeline` table are keyed by domain (on the assumption that mediawiki sharding would never be more granular than this). Updates can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded partition growth. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC); CREATE TABLE timeline ( domain text, ts timestamp, rev int, PRIMARY KEY(domain, ts) ) WITH CLUSTERING ORDER BY (rev DESC) AND compaction = {'class': 'TimeWindowCompactionStrategy'} AND default_time_to_live = 2592000; -- 30 days(?) ``` ```lang=python, name=Pseudo code # Look for a revision that corresponds with a timestamp at least 86400 seconds in the past target_rev = get_target_rev(domain, timestamp_for(tid) - 86400000) # If a target revision was found, range delete to cull past revisions older-than retention period if target_rev: batch( store_new_current(domain, title, rev, tid, value), delete_older_than(domain, title, target_rev), ) update_timeline(domain, timestamp_for(tid)) # If a target revision was NOT found, skip the range delete for this iteration else: store_new_current(domain, title, rev, tid, value) update_timeline(domain, timestamp_for(tid)) ``` ```lang=sql, name=Sample queries -- Find a revision corresponding to a timestamp outside the retention period SELECT ts, rev FROM timeline WHERE domain = ? AND timestamp < ? LIMIT 1; -- Delete records with a revision preceding that returned above (one established to exist outside the retention period) DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev <= ?; ``` ____ ## See also - {T156209}