Change Details

(WARNING) WIP; Please stand-by... As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ```lang=python, name=Update logic pseudo-code # Read current latest value from 'current' table current = read_current() # Copy the values using the correct sequence for i in (section_offsets, data-parsoid, html): for j in (by_tid, by_rev, current): store_current(i, j, current) # On success, write the new current to 'current' store_new_current(latest) ``` ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. ```lang=sql, name=Strawman Cassandra schema -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` #### Issues/Drawbacks - Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence) - Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps) - Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps) - An abuse of the "tie-breaker mechanism" - Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps) - Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning) - Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating ### Option 1b Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ```lang=sql, name=Example: Batched INSERT+DELETE BEGIN BATCH INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?); DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000; APPLY BATCH ``` NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0 NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT` NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization. #### Issues/Drawbacks - Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]] ## Option 2: Retention policies using application-level TTLs This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention. Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted. Limiting renders is more challenging, since the revision is a monotonically increasing integer, without any temporal context. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev, tid) ); CREATE TABLE timeline ( domain text, rev int, ts timestamp, PRIMARY KEY(domain, rev) ); ``` ____ ## See also - {T156209}

(WARNING) WIP; Please stand-by... As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ```lang=python, name=Update logic pseudo-code # Read current latest value from 'current' table current = read_current() # Copy the values using the correct sequence for i in (section_offsets, data-parsoid, html): for j in (by_tid, by_rev, current): store_current(i, j, current) # On success, write the new current to 'current' store_new_current(latest) ``` ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. ```lang=sql, name=Strawman Cassandra schema -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` #### Issues/Drawbacks - Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence) - Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps) - Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps) - An abuse of the "tie-breaker mechanism" - Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps) - Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning) - Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating ### Option 1b Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ```lang=sql, name=Example: Batched INSERT+DELETE BEGIN BATCH INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?); DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000; APPLY BATCH ``` NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0 NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT` NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization. #### Issues/Drawbacks - Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]] ## Option 2: Retention policies using application-level TTLs This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention. Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted. Limiting renders is more challenging, since the revision is a monotonically increasing integer, without any temporal context. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC); CREATE TABLE timeline ( domain text, ts timestamp, rev int, PRIMARY KEY(domain, ts) ) WITH CLUSTERING ORDER BY (rev DESC) AND compaction = {'class': 'TimeWindowCompactionStrategy'} AND default_time_to_live = 2592000; -- 30 days(?) ``` ____ ## See also - {T156209}

(WARNING) WIP; Please stand-by... As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ```lang=python, name=Update logic pseudo-code # Read current latest value from 'current' table current = read_current() # Copy the values using the correct sequence for i in (section_offsets, data-parsoid, html): for j in (by_tid, by_rev, current): store_current(i, j, current) # On success, write the new current to 'current' store_new_current(latest) ``` ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. ```lang=sql, name=Strawman Cassandra schema -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` #### Issues/Drawbacks - Breaks `DELETE` semantics (w/o timestamps, tombstones do not have precedence) - Defeats a read optimization that can eliminate SSTables from reads (relies on timestamps) - Defeats compaction optimization meant to eliminate overlaps for tombstone GC (relies on timestamps) - An abuse of the "tie-breaker mechanism" - Lexical value comparison only meant as a fall-back for a rare occurrence (accidentally identical timestamps) - Lexical comparison not a part of the contract, could change in the future w/o warning (has changed w/o warning) - Cassandra semantics explicitly //last write wins//; Violation of best-practice, and isolating ### Option 1b Identical to the 1a proposal above, with the exception of how the `current` table is implemented; In this approach, `current` is modeled as "wide rows", utilizing a revision-based clustering key. For any given `rev`, re-renders result in the `tid` and `value` attributes being overwritten each time. To prevent unbounded grow of revisions, range deletes are batched with the `INSERT`. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ```lang=sql, name=Example: Batched INSERT+DELETE BEGIN BATCH INSERT INTO current ("_domain", title, rev, tid, value) VALUES (?, ?, 10000, ?, ?); DELETE FROM current WHERE "_domain" = ? AND title = ? AND rev < 10000; APPLY BATCH ``` NOTE: Range deletes with inequality matches (LT, GT, LTE, GTE) require Cassandra >= 3.0 NOTE: It is only necessary to batch a corresponding `DELETE` when creating a new revision; Re-renders can be preformed with an unbatched/standalone `INSERT` NOTE: Batching `DELETE`s can be done probabilistically; Allowing a small number of revisions to accumulate to limit write amplification is likely a worthwhile optimization. #### Issues/Drawbacks - Creates a hard dependency on [[ https://phabricator.wikimedia.org/T160570 | Cassandra 3.x ]] ## Option 2: Retention policies using application-level TTLs This approach uses a schema identical to that of the current storage model, one that utilizes wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention. Since renders are keyed on a type-1 UUID, maintaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete on new renders using a `tid` predicate 24 hours less than the one being inserted. Limiting renders is more challenging, since the revision is a monotonically increasing integer, without any temporal context. ```lang=sql, name=Strawman Cassandra schema CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC); CREATE TABLE timeline ( domain text, ts timestamp, rev int, ts timestamp,PRIMARY KEY(domain, ts) ) WITH CLUSTERING ORDER BY (rev DESC) PRIMARY KEY(domain, rev)AND compaction = {'class': 'TimeWindowCompactionStrategy'} ); AND default_time_to_live = 2592000; -- 30 days(?) ``` ____ ## See also - {T156209}