Change Details

Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases. This has created a number of challenges that have so far been met with efforts such as: * Cluster expansions ([[ https://phabricator.wikimedia.org/T93790 | one executed ]], [[ https://phabricator.wikimedia.org/T139961 | one planned ]]) * Multi-instance * [[ https://phabricator.wikimedia.org/T125904 | Compression ]] * [[ https://phabricator.wikimedia.org/T140008 | Manual culling ]] * Automatic culling (aka revision retention policies) == Cassandra 101 == Cassandra is fundamentally a key-value store; Keys dictate placement in the cluster, or distribution, and are typically referred to as //partitions//. Values are sorted arrays of columns, that are indexed, and can be queried (with some limitations). ``` [ key ]( 1 )( 2 ) ... ( N ) ``` Additionally, there are abstractions (closely coupled to CQL), that allow you to map a tabular structure onto what would otherwise be a flat sorted array of columns. Consider the following example: ```lang=sql -- Table to store tracks by-artist and album CREATE TABLE music ( artist text, album text, track text, PRIMARY KEY(artist, album, track) ); ``` The schema above has 3 attributes, `artist`, `album`, and `track`. `artist` is the first element of the `PRIMARY KEY`, which makes it the partition key, and each subsequent attribute establishes a many-to-one relationship to the value preceding it (so there can be many `album`s for each `artist`, and many `track`s for each `album`). `INSERT`ing and `SELECT`ing data works just as you would expect... ``` cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'As I Am'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'This Dying Soul'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Endless Sacrifice'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Honor Thy Father'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Vacant'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Stream of Consciousness'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'In the Name of God'); cqlsh:db> SELECT * FROM music WHERE artist = 'Dream Theater'; artist | album | track ---------------+------------------+------------------------- Dream Theater | Train of Thought | As I Am Dream Theater | Train of Thought | Endless Sacrifice Dream Theater | Train of Thought | Honor Thy Father Dream Theater | Train of Thought | In the Name of God Dream Theater | Train of Thought | Stream of Consciousness Dream Theater | Train of Thought | This Dying Soul Dream Theater | Train of Thought | Vacant (7 rows) cqlsh:db> ``` ...but it's important to realize that under the covers, it is still an array of column values bound to a single key/partition, something like: ``` [ Dream Theater ]( Train of Thought,As I Am )( Train of Thought,Endless Sacrifice )( ... )( Train of Thought,Vacant ) ``` === On disk === Cassandra uses a [[ https://en.wikipedia.org/wiki/Log-structured_merge-tree | log-structure merge-tree ]]. As values are written, data is periodically flushed to disk into immutable files (log-structured), and the data for a given partition can (and will be) persisted across multiple files. Reads collate results from applicable files on disk (merge-tree), and compaction operates asynchronously to combine the files to keep merge complexity bounded. Since storage is log-structured, updates do not happen in place, a new record is appended with a more recent timestamp, and the merge-on-read reconciles what is current. Deletes are a special-case update, they insert a special marker, or //tombstone// to indicate any preceding values should not be considered in results. Additionally, a TTL property can be associated with a record to indicate its time-to-live, after which it becomes de facto tombstoned. In each of these cases, it falls to compaction to garbage-collect this obsolete/tombstoned data. == RESTBase 101 == Now consider a (simplified) version of RESTBase's key-rev-value data-model: ```lang=sql CREATE TABLE data ( key text, rev int, tid timeuuid, value blob, PRIMARY KEY (key, rev, tid) ); ``` In the schema above, `key` is the document title, `rev` the monotonically increasing revision (comes from mediawiki), `tid` a time-based UUID generated at the time of save, and `value` is the content being stored (parsoid-generated html, for example). An insertion happens for each new mediawiki revision, but also each time the content needs to be re-rendered (for example if a template edit would result in different parsoid output). === Revision retention policies === In an effort to somewhat limit the growth of partitions over time, we implemented what we call //revision retention policies//. Each time a new write completes, an async read is performed to query for some number of historical entries, a policy is applied to the results, and values which are excluded by policy are updated to include a TTL. == The Problems == Both the contrived data model above for storing audio tracks, and the RESTBase one, could be described as "unbounded". They are unbounded in the sense that a partition will continue to grow for as long as artists release new tracks, or for as many times as a given document is edited (and/or its content rerendered). This isn't necessary a bad thing if the practical limits on the growth of data are reasonable (i.e. how many tracks could a single artist release in their lifetime? 1,000s? 10,000s?). In the case of RESTBase, the values themselves are large (ranging anywhere from 10s of K, to several MB in size), and the number of values per-partition can grow very rapidly. Even after numerous efforts at culling data, we have seen partitions exceeding 30G(!) in size. (NOTE) The closest thing to an Official recommendation on partition size, is the default at which Cassandra issues a warning, which is 100MB. This may be somewhat on the conservative side of things; 100s of MB is probably OK, (GBs is not). The oft-cited issues with wide partitions (and we've experienced many of these): * Heap pressure on compaction * Heap pressure on query * High(er) query latency * Repairs inefficiencies (hashes calculated by-partition) * Streaming inefficiencies (streaming occurs by-partition) === Tombstone GC === As mentioned earlier, RESTBase employs revision retention policies to limit the number of revisions that accumulate over time. However, these create problems of their own. As you may recall, Cassandra garbage-collects tombstoned data during compaction. One issue with this process is the so-called //[[ http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#compactions |overlapping tables problem]]//. These compactions typically only involve a subset of the on-disk files, and tombstoned data can only be GC'd when the tables under consideration represent the whole picture; You cannot drop otherwise expired data if there is a possibility that older (preceding) values exist in other files (tables //not// under consideration of the current compaction). Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tombstone in a file is greater. Since partitions in RESTBase run quite wide, the former is of little help in eliminating overlap (partitions tend to span all tables), and out-of-order writes generated by read-repair defeat the latter. This leaves us with partitions which remain wide due to uncollectable tombstoned data distributed throughout the partition. ---- I believe that, for all intents and purposes, RESTBase's use of Cassandra could be considered an anti-pattern, and I see 3 (broad) possible courses of action: # Rethink RESTBase's data model and access # Invest effort into solutions to make Cassandra better accommodate RESTBase's data model/access # Identify alternatives to Cassandra (NOTE) For either of #2 or #3 above, I believe it is imperative that we explicitly establish up front //how wide is wide enough//, or if it is truly our objective to be able to store partitions of unlimited width.

Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases. This has created a number of challenges that have so far been met with efforts such as: * Cluster expansions ([[ https://phabricator.wikimedia.org/T93790 | many executed ]], [[ https://phabricator.wikimedia.org/T139961 | one planned ]]) * Multi-instance * [[ https://phabricator.wikimedia.org/T125904 | Compression ]] * [[ https://phabricator.wikimedia.org/T140008 | Manual culling ]] * Automatic culling (aka revision retention policies) == Cassandra 101 == Cassandra is fundamentally a key-value store; Keys dictate placement in the cluster, or distribution, and are typically referred to as //partitions//. Values are sorted arrays of columns, that are indexed, and can be queried (with some limitations). ``` [ key ]( 1 )( 2 ) ... ( N ) ``` Additionally, there are abstractions (closely coupled to CQL), that allow you to map a tabular structure onto what would otherwise be a flat sorted array of columns. Consider the following example: ```lang=sql -- Table to store tracks by-artist and album CREATE TABLE music ( artist text, album text, track text, PRIMARY KEY(artist, album, track) ); ``` The schema above has 3 attributes, `artist`, `album`, and `track`. `artist` is the first element of the `PRIMARY KEY`, which makes it the partition key, and each subsequent attribute establishes a many-to-one relationship to the value preceding it (so there can be many `album`s for each `artist`, and many `track`s for each `album`). `INSERT`ing and `SELECT`ing data works just as you would expect... ``` cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'As I Am'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'This Dying Soul'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Endless Sacrifice'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Honor Thy Father'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Vacant'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Stream of Consciousness'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'In the Name of God'); cqlsh:db> SELECT * FROM music WHERE artist = 'Dream Theater'; artist | album | track ---------------+------------------+------------------------- Dream Theater | Train of Thought | As I Am Dream Theater | Train of Thought | Endless Sacrifice Dream Theater | Train of Thought | Honor Thy Father Dream Theater | Train of Thought | In the Name of God Dream Theater | Train of Thought | Stream of Consciousness Dream Theater | Train of Thought | This Dying Soul Dream Theater | Train of Thought | Vacant (7 rows) cqlsh:db> ``` ...but it's important to realize that under the covers, it is still an array of column values bound to a single key/partition, something like: ``` [ Dream Theater ]( Train of Thought,As I Am )( Train of Thought,Endless Sacrifice )( ... )( Train of Thought,Vacant ) ``` === On disk === Cassandra uses a [[ https://en.wikipedia.org/wiki/Log-structured_merge-tree | log-structure merge-tree ]]. As values are written, data is periodically flushed to disk into immutable files (log-structured), and the data for a given partition can (and will be) persisted across multiple files. Reads collate results from applicable files on disk (merge-tree), and compaction operates asynchronously to combine the files to keep merge complexity bounded. Since storage is log-structured, updates do not happen in place, a new record is appended with a more recent timestamp, and the merge-on-read reconciles what is current. Deletes are a special-case update, they insert a special marker, or //tombstone// to indicate any preceding values should not be considered in results. Additionally, a TTL property can be associated with a record to indicate its time-to-live, after which it becomes de facto tombstoned. In each of these cases, it falls to compaction to garbage-collect this obsolete/tombstoned data. == RESTBase 101 == Now consider a (simplified) version of RESTBase's key-rev-value data-model: ```lang=sql CREATE TABLE data ( key text, rev int, tid timeuuid, value blob, PRIMARY KEY (key, rev, tid) ); ``` In the schema above, `key` is the document title, `rev` the monotonically increasing revision (comes from mediawiki), `tid` a time-based UUID generated at the time of save, and `value` is the content being stored (parsoid-generated html, for example). An insertion happens for each new mediawiki revision, but also each time the content needs to be re-rendered (for example if a template edit would result in different parsoid output). === Revision retention policies === In an effort to somewhat limit the growth of partitions over time, we implemented what we call //revision retention policies//. Each time a new write completes, an async read is performed to query for some number of historical entries, a policy is applied to the results, and values which are excluded by policy are updated to include a TTL. == The Problems == Both the contrived data model above for storing audio tracks, and the RESTBase one, could be described as "unbounded". They are unbounded in the sense that a partition will continue to grow for as long as artists release new tracks, or for as many times as a given document is edited (and/or its content rerendered). This isn't necessary a bad thing if the practical limits on the growth of data are reasonable (i.e. how many tracks could a single artist release in their lifetime? 1,000s? 10,000s?). In the case of RESTBase, the values themselves are large (ranging anywhere from 10s of K, to several MB in size), and the number of values per-partition can grow very rapidly. Even after numerous efforts at culling data, we have seen partitions exceeding 30G(!) in size. (NOTE) The closest thing to an Official recommendation on partition size, is the default at which Cassandra issues a warning, which is 100MB. This may be somewhat on the conservative side of things; 100s of MB is probably OK, (GBs is not). The oft-cited issues with wide partitions (and we've experienced many of these): * Heap pressure on compaction * Heap pressure on query * High(er) query latency * Repairs inefficiencies (hashes calculated by-partition) * Streaming inefficiencies (streaming occurs by-partition) === Tombstone GC === As mentioned earlier, RESTBase employs revision retention policies to limit the number of revisions that accumulate over time. However, these create problems of their own. As you may recall, Cassandra garbage-collects tombstoned data during compaction. One issue with this process is the so-called //[[ http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#compactions |overlapping tables problem]]//. These compactions typically only involve a subset of the on-disk files, and tombstoned data can only be GC'd when the tables under consideration represent the whole picture; You cannot drop otherwise expired data if there is a possibility that older (preceding) values exist in other files (tables //not// under consideration of the current compaction). Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tombstone in a file is greater. Since partitions in RESTBase run quite wide, the former is of little help in eliminating overlap (partitions tend to span all tables), and out-of-order writes generated by read-repair defeat the latter. This leaves us with partitions which remain wide due to uncollectable tombstoned data distributed throughout the partition. ---- I believe that, for all intents and purposes, RESTBase's use of Cassandra could be considered an anti-pattern, and I see 3 (broad) possible courses of action: # Rethink RESTBase's data model and access # Invest effort into solutions to make Cassandra better accommodate RESTBase's data model/access # Identify alternatives to Cassandra (NOTE) For either of #2 or #3 above, I believe it is imperative that we explicitly establish up front //how wide is wide enough//, or if it is truly our objective to be able to store partitions of unlimited width.

Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or less) complete history of transformations in order to enable future alternative use-cases. This has created a number of challenges that have so far been met with efforts such as: * Cluster expansions ([[ https://phabricator.wikimedia.org/T93790 | onemany executed ]], [[ https://phabricator.wikimedia.org/T139961 | one planned ]]) * Multi-instance * [[ https://phabricator.wikimedia.org/T125904 | Compression ]] * [[ https://phabricator.wikimedia.org/T140008 | Manual culling ]] * Automatic culling (aka revision retention policies) == Cassandra 101 == Cassandra is fundamentally a key-value store; Keys dictate placement in the cluster, or distribution, and are typically referred to as //partitions//. Values are sorted arrays of columns, that are indexed, and can be queried (with some limitations). ``` [ key ]( 1 )( 2 ) ... ( N ) ``` Additionally, there are abstractions (closely coupled to CQL), that allow you to map a tabular structure onto what would otherwise be a flat sorted array of columns. Consider the following example: ```lang=sql -- Table to store tracks by-artist and album CREATE TABLE music ( artist text, album text, track text, PRIMARY KEY(artist, album, track) ); ``` The schema above has 3 attributes, `artist`, `album`, and `track`. `artist` is the first element of the `PRIMARY KEY`, which makes it the partition key, and each subsequent attribute establishes a many-to-one relationship to the value preceding it (so there can be many `album`s for each `artist`, and many `track`s for each `album`). `INSERT`ing and `SELECT`ing data works just as you would expect... ``` cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'As I Am'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'This Dying Soul'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Endless Sacrifice'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Honor Thy Father'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Vacant'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'Stream of Consciousness'); cqlsh:db> INSERT INTO music (artist, album, track) VALUES ('Dream Theater', 'Train of Thought', 'In the Name of God'); cqlsh:db> SELECT * FROM music WHERE artist = 'Dream Theater'; artist | album | track ---------------+------------------+------------------------- Dream Theater | Train of Thought | As I Am Dream Theater | Train of Thought | Endless Sacrifice Dream Theater | Train of Thought | Honor Thy Father Dream Theater | Train of Thought | In the Name of God Dream Theater | Train of Thought | Stream of Consciousness Dream Theater | Train of Thought | This Dying Soul Dream Theater | Train of Thought | Vacant (7 rows) cqlsh:db> ``` ...but it's important to realize that under the covers, it is still an array of column values bound to a single key/partition, something like: ``` [ Dream Theater ]( Train of Thought,As I Am )( Train of Thought,Endless Sacrifice )( ... )( Train of Thought,Vacant ) ``` === On disk === Cassandra uses a [[ https://en.wikipedia.org/wiki/Log-structured_merge-tree | log-structure merge-tree ]]. As values are written, data is periodically flushed to disk into immutable files (log-structured), and the data for a given partition can (and will be) persisted across multiple files. Reads collate results from applicable files on disk (merge-tree), and compaction operates asynchronously to combine the files to keep merge complexity bounded. Since storage is log-structured, updates do not happen in place, a new record is appended with a more recent timestamp, and the merge-on-read reconciles what is current. Deletes are a special-case update, they insert a special marker, or //tombstone// to indicate any preceding values should not be considered in results. Additionally, a TTL property can be associated with a record to indicate its time-to-live, after which it becomes de facto tombstoned. In each of these cases, it falls to compaction to garbage-collect this obsolete/tombstoned data. == RESTBase 101 == Now consider a (simplified) version of RESTBase's key-rev-value data-model: ```lang=sql CREATE TABLE data ( key text, rev int, tid timeuuid, value blob, PRIMARY KEY (key, rev, tid) ); ``` In the schema above, `key` is the document title, `rev` the monotonically increasing revision (comes from mediawiki), `tid` a time-based UUID generated at the time of save, and `value` is the content being stored (parsoid-generated html, for example). An insertion happens for each new mediawiki revision, but also each time the content needs to be re-rendered (for example if a template edit would result in different parsoid output). === Revision retention policies === In an effort to somewhat limit the growth of partitions over time, we implemented what we call //revision retention policies//. Each time a new write completes, an async read is performed to query for some number of historical entries, a policy is applied to the results, and values which are excluded by policy are updated to include a TTL. == The Problems == Both the contrived data model above for storing audio tracks, and the RESTBase one, could be described as "unbounded". They are unbounded in the sense that a partition will continue to grow for as long as artists release new tracks, or for as many times as a given document is edited (and/or its content rerendered). This isn't necessary a bad thing if the practical limits on the growth of data are reasonable (i.e. how many tracks could a single artist release in their lifetime? 1,000s? 10,000s?). In the case of RESTBase, the values themselves are large (ranging anywhere from 10s of K, to several MB in size), and the number of values per-partition can grow very rapidly. Even after numerous efforts at culling data, we have seen partitions exceeding 30G(!) in size. (NOTE) The closest thing to an Official recommendation on partition size, is the default at which Cassandra issues a warning, which is 100MB. This may be somewhat on the conservative side of things; 100s of MB is probably OK, (GBs is not). The oft-cited issues with wide partitions (and we've experienced many of these): * Heap pressure on compaction * Heap pressure on query * High(er) query latency * Repairs inefficiencies (hashes calculated by-partition) * Streaming inefficiencies (streaming occurs by-partition) === Tombstone GC === As mentioned earlier, RESTBase employs revision retention policies to limit the number of revisions that accumulate over time. However, these create problems of their own. As you may recall, Cassandra garbage-collects tombstoned data during compaction. One issue with this process is the so-called //[[ http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#compactions |overlapping tables problem]]//. These compactions typically only involve a subset of the on-disk files, and tombstoned data can only be GC'd when the tables under consideration represent the whole picture; You cannot drop otherwise expired data if there is a possibility that older (preceding) values exist in other files (tables //not// under consideration of the current compaction). Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tombstone in a file is greater. Since partitions in RESTBase run quite wide, the former is of little help in eliminating overlap (partitions tend to span all tables), and out-of-order writes generated by read-repair defeat the latter. This leaves us with partitions which remain wide due to uncollectable tombstoned data distributed throughout the partition. ---- I believe that, for all intents and purposes, RESTBase's use of Cassandra could be considered an anti-pattern, and I see 3 (broad) possible courses of action: # Rethink RESTBase's data model and access # Invest effort into solutions to make Cassandra better accommodate RESTBase's data model/access # Identify alternatives to Cassandra (NOTE) For either of #2 or #3 above, I believe it is imperative that we explicitly establish up front //how wide is wide enough//, or if it is truly our objective to be able to store partitions of unlimited width.