Change Details

(WARNING) WIP; Please stand-by... As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) NOTE: The latter two tables are only necessary to support VE concurrency; The `by_rev` and `by_tid` tables are only required to store historical/past versions for a reasonable period after being superseded. ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. #### Strawman Cassandra schemas ```lang=sql -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ### Options 1b #### Strawman Cassandra schemas This is identical to that of 1a above, with the exception of how the `current` table is implemented. ```lang=sql CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ## Option 2: Retention policies using application-level TTLs !!TODO!! ---- ### Performance considerations #### The `data` table This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage. However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate. Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy. More thorough testing (using less contrived workloads/data), in a more production-like environment is needed. #### The `ttl` table This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large. That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see. See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac). ____ ## See also - {T156209}

(WARNING) WIP; Please stand-by... As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Table-per-query This approach materializes views of results using distinct tables, each corresponding to a query. ### Queries - The most current render of the most current revision (table: `current`) - The most current render of a specific revision (table: `by_rev`) - A specific render of a specific revision (table: `by_tid`) NOTE: The latter two tables are only necessary to support VE concurrency; The `by_rev` and `by_tid` tables are only required to store historical/past versions for a reasonable period after being superseded. ### Algorithm Data in the `current` table must be durable, but the contents of `by_rev` and `by_tid` can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in `current` has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from `current`, or b) copying them on update, prior to replacing a value in `current`. Neither of these strategies are ideal. For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy //must// be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from `current` for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the //copy-on-update// approach. # Value is read from `current` table # Value is copied to `by_tid` table # Value is copied to `by_rev` table # If 2 and 3 are successful, new value is written to `current` ### Option 1a Precedence is first by revision, then by render; The current table must always return the latest render for the latest revision, even in the face of out-of-order writes. This presents a challenge for a table modeled as strictly key-value, since Cassandra is //last write wins//. As a work around, this option proposes to use a constant for write-time, effectively disabling the database's in-built conflict resolution. Since Cassandra falls back to a lexical comparison of values when encountering identical timestamps, a binary value encoded first with the revision, and then with a type-1 UUID is used to satisfy precedence requirements. #### Strawman Cassandra schemas ```lang=sql -- value is binary encoded; rev (as 32-bit big-endian), tid (as 128-bit type-1 UUID), and content CREATE TABLE current ( "_domain" text, title text, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ### Options 1b #### Strawman Cassandra schemas This is identical to that of 1a above, with the exception of how the `current` table is implemented. ```lang=sql CREATE TABLE current ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title), rev) ); -- Same as Option 1a above CREATE TABLE by_rev ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev)) ); CREATE TABLE by_tid ( "_domain" text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, rev, tid)) ); ``` ## Option 2: Retention policies using application-level TTLs !!TODO!! ---- ### Performance considerations #### The `data` table This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage. However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate. Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy. More thorough testing (using less contrived workloads/data), in a more production-like environment is needed. #### The `ttl` table This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large. That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see. See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac). ____ ## See also - {T156209}