Change Details

As a sub-task of T120171, this task discusses steps towards storing only current revisions in a reliable, low-maintenance and low-latency manner. ## Option 1: Avoid tombstones with separate current revision & ttl tables ### Table schemas ```lang=sql CREATE TABLE data ( "_domain" text, title text, revision int, tid timeuuid, html text, data_parsoid text, section_offsets text, PRIMARY KEY ("_domain", title) ); CREATE TABLE ttl ( "_domain" text, time_window bigint, title text, revision int, tid timeuuid, html text, data_parsoid text, section_offsets text, PRIMARY KEY (("_domain", time_window, title, revision), tid) ); ``` ### Algorithm Latest content is always overwritten in the data table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions. In case an update comes and a new render becomes the latest render, the following procedure is applied: - Current latest content is copied to the `ttl` table with a TTL of 24 hours. - The new render (revision) is written to the `data` table, overwriting the previous one. This ensures that any ongoing edits that were using the previous content of the `data` table will succeed because the content they depend on is stored for another 24 hours. If the edit is made to an older revision, we check the `ttl` table if we have that revision in there and potentially renew the TTL. If the older revision is not in storage, it's generated by Parsoid and stored in the `ttl` table. ### Implementation considerations This could effectively be implemented as a revision-retention-policy in the scope of `restbase-mod-table-cassandra` module. This could have 3 modes of operation: - If `grace_ttl=0` it works as a key_value storage always overwriting the newer content. We can just create the data table and avoid creating the checkout table. - If the policy is TTL we just create the checkout table. - Mixed mode - we create both tables. If all use-cases for a revision-retention-policy could be fit into these 3 options, we can remove the revision retention policy that we have right now completely. ### Open questions - Should HTML and data-parsoid be stored together or in the separate tables? What's the performance implications of this? Whats the complexity overhead of separating them? - Should we just set the TTL globally on the `ttl` table? ### Performance considerations #### The `data` table This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage. However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate. Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy. More thorough testing (using less contrived workloads/data), in a more production-like environment is needed. #### The `ttl` table This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large. That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see. NOTE: This is something else to consider when storing all 3 values in the same table, as the result size for each render would be larger, requiring fewer to be a problem, and therefore creating a lower re-render rate to be concerned with. See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac). ## See also - {T156209}

As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner. ## Option 1: Avoid tombstones with separate current revision & ttl tables ### Table schemas ```lang=sql CREATE TABLE current ( "_domain" text, title text, revision int, tid timeuuid, value blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE by_rev ( "_domain" text, title text, revision int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, revision)) ); CREATE TABLE by_tid ( "_domain" text, title text, revision int, tid timeuuid, value blob, PRIMARY KEY (("_domain", title, revision, tid)) ); ``` ### Algorithm Latest content is always overwritten in the data table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions. In case an update comes and a new render becomes the latest render, the following procedure is applied: - Current latest content is copied to the `ttl` table with a TTL of 24 hours. - The new render (revision) is written to the `data` table, overwriting the previous one. This ensures that any ongoing edits that were using the previous content of the `data` table will succeed because the content they depend on is stored for another 24 hours. If the edit is made to an older revision, we check the `ttl` table if we have that revision in there and potentially renew the TTL. If the older revision is not in storage, it's generated by Parsoid and stored in the `ttl` table. ### Implementation considerations This could effectively be implemented as a revision-retention-policy in the scope of `restbase-mod-table-cassandra` module. This could have 3 modes of operation: - If `grace_ttl=0` it works as a key_value storage always overwriting the newer content. We can just create the data table and avoid creating the checkout table. - If the policy is TTL we just create the checkout table. - Mixed mode - we create both tables. If all use-cases for a revision-retention-policy could be fit into these 3 options, we can remove the revision retention policy that we have right now completely. ### Open questions - Should HTML and data-parsoid be stored together or in the separate tables? What's the performance implications of this? Whats the complexity overhead of separating them? - Should we just set the TTL globally on the `ttl` table? ### Performance considerations #### The `data` table This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage. However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate. Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy. More thorough testing (using less contrived workloads/data), in a more production-like environment is needed. #### The `ttl` table This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large. That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see. NOTE: This is something else to consider when storing all 3 values in the same table, as the result size for each render would be larger, requiring fewer to be a problem, and therefore creating a lower re-render rate to be concerned with. See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac). ## See also - {T156209}

As a sub-task of T120171, this task discusses steps towards storing only current revisions only, in a reliablele, low-maintenance, low-maintenance and low-latency manner. ## Option 1: Avoid tombstones with separate current revision & ttl tables ### Table schemas ```lang=sql CREATE TABLE datacurrent ( "_domain" text, title text, revision int, tid timeuuid, html text, data_parsoid text, section_offsets textvalue blob, PRIMARY KEY ("_domain", title) ); CREATE TABLE ttlby_rev ( "_domain" text, time_window bigint, title text, revision int, tid timeuuid, html text,value blob, PRIMARY KEY (("_domain", title, revision)) ); CREATE TABLE by_tid ( data_parso"_domain" text, title text, revision int, tid textimeuuid, section_offsets textvalue blob, PRIMARY KEY (("_domain", time_window, title, revision), tid)) ); ``` ### Algorithm Latest content is always overwritten in the data table without a TTL. To protect against race conditions, the Cassandra write timestamp is set to the time of the edit, as determined by the `If-Modified-Since` header. Since the `If-Modified-Since` timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions. In case an update comes and a new render becomes the latest render, the following procedure is applied: - Current latest content is copied to the `ttl` table with a TTL of 24 hours. - The new render (revision) is written to the `data` table, overwriting the previous one. This ensures that any ongoing edits that were using the previous content of the `data` table will succeed because the content they depend on is stored for another 24 hours. If the edit is made to an older revision, we check the `ttl` table if we have that revision in there and potentially renew the TTL. If the older revision is not in storage, it's generated by Parsoid and stored in the `ttl` table. ### Implementation considerations This could effectively be implemented as a revision-retention-policy in the scope of `restbase-mod-table-cassandra` module. This could have 3 modes of operation: - If `grace_ttl=0` it works as a key_value storage always overwriting the newer content. We can just create the data table and avoid creating the checkout table. - If the policy is TTL we just create the checkout table. - Mixed mode - we create both tables. If all use-cases for a revision-retention-policy could be fit into these 3 options, we can remove the revision retention policy that we have right now completely. ### Open questions - Should HTML and data-parsoid be stored together or in the separate tables? What's the performance implications of this? Whats the complexity overhead of separating them? - Should we just set the TTL globally on the `ttl` table? ### Performance considerations #### The `data` table This table amounts to simple key-value storage (there are no range queries in the conventional sense); A growing set of keys is perpetually updated (the values overwritten), at wildly varying frequencies. It will suffer from one of the problems described in {T144431}, that is that a significant percentage of the key set will have a tendency to work its way across the entire set of SSTables, making the all-critical SSTables/read metric equivalent to the total number of SSTables, and the reclamation of overwritten data problematic. TL;DR This is a workload that is at odds with log-structured storage. However, given the relative small size and slow growth of the live data, it seems likely that this use-case can be made tractable. One way would be to accept high compaction write amplification, and potentially a higher than ideal, but bounded tombstone GC. For example, LCS configured with very large table sizes to limit the number of quiescent levels to 2 or 3 would bound SSTables/read and create a sustainable (if high) droppable tombstone rate. Another alternative: Upgrading to Cassandra >= 3.2 would make [[ https://issues.apache.org/jira/browse/CASSANDRA-6696 | CASSANDRA-6696 ]] available to us. By dividing an instances dataset over many compactors, locality could be improved regardless of compaction strategy. More thorough testing (using less contrived workloads/data), in a more production-like environment is needed. #### The `ttl` table This table uses a cluster key to create a many to one relationship between a `timeuuid` (aka a render), and the partition key. Queries that supply a predicate for the `timeuuid` will be efficient, regardless of the number of tombstones, but queries for the latest `timeuuid` have the potential to cause problems (similar to that of the existing k-r-v model). That is to say, for any given value size, there will exist a rate of re-renders (of a single revision) high enough that the live + tombstoned results will be prohibitively large. That said, for this to be a problem, the re-render rate would need to be high enough to accumulate a critical mass of tombstones within a period bound by `time_window`, and/or the TTL period + the time-to-GC. As this table is write-once, TTL-only, and will make use of TWCS, tombstone GC should be fairly deterministic. It seems likely that we can find a configuration safe from any Real World re-render rate we might see. NOTE: This is something else to consider when storing all 3 values in the same table, as the result size for each render would be larger, requiring fewer to be a problem, and therefore creating a lower re-render rate to be concerned with. See [this document](https://docs.google.com/document/d/1qd8XilG5Jt0TRm5mMEokCG6d0_DkyReCi85KKlh-i8c/edit#heading=h.m7ioe5euvcac). ## See also - {T156209}