Change Details

The dependency graph forms a directed acyclic graph (DAG), which we need to persist. Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. Generally, the dependency graph tends to be fairly shallow. Currently, most relationships are modeled as a single level (link tables) without any recursion. We expect the depth to increase slightly in the future, but expect the overall depth to still remain very moderate. While potentially useful, native support for recursive queries is not central for the dependency tracking system. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit. Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach based on [Flock](https://github.com/twitter/flockdb), listing transitive queries as an explicit non-goal of their graph storage system. ## Design sketch Considerations: - on edit, we are looking for all dependent items - need `priority` for push/pull decision; process updates in priority order - need local (source) fragment ids (or some selector blob) to determine match - need destination fragment ids for processing - on pull, need to quickly find all items that need to be pull-updated - order by `priority` asc (low prio / pull first) - need local (destination) fragment ids - on dependency update, we need an efficient way to select all previous dependencies in order to diff & update - okay to update in 'forward' direction (source -> target) - need to update fragment ids - want to coalesce updates - group updates to all fragments per page - if entire page was re-rendered from scratch at time X, then fragments are also up to date with X Possible table schema, using [RESTBase table schema syntax](https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md): ``` { table: 'dependencies', attributes: { child: 'string', type: 'int', priority: 'int', parent: 'string', parent_selector: 'json' child_selector: 'json', needs_update: 'timestamp', last_updated: 'timestamp', created: 'timestamp', }, index: [ { type: 'hash', attribute: 'child' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'asc' }, // pull first { type: 'range', attribute: 'parent' }, ], secondaryIndexes: { by_parent: [ { type: 'hash', attribute: 'parent' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'desc' }, // push first { type: 'range', attribute: 'child' }, { type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering { type: 'proj', attribute: 'last_updated' }, ] } } ``` - type: creation / deletion (red links) or modification (most jobs) - priority establishes push / pull via threshold - needs-update & updated timestamps for hybrid change propagation - could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*

The dependency graph forms a directed acyclic graph (DAG), which we need to persist. Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. Generally, the dependency graph tends to be fairly shallow. Currently, most relationships are modeled as a single level (link tables) without any recursion. We expect the depth to increase slightly in the future, but expect the overall depth to still remain very moderate. While potentially useful, native support for recursive queries is not central for the dependency tracking system. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit. Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach based on [Flock](https://github.com/twitter/flockdb), listing transitive queries as an explicit non-goal of their graph storage system. ## Design sketch Considerations: - **on edit**, we are looking for all dependent items - need `priority` for push/pull decision; process updates in priority order - need local (source) fragment ids (or some selector blob) to determine match - need destination fragment ids for processing - **on pull**: - check whether anything in the page needs updating (`needs_update != null`) - if so, quickly find all items that need to be pull-updated (`child_selector`) - order processing by `priority` asc (low prio / pull first) - **on dependency update**: - need an efficient and reliable way to select all previous dependencies in order to diff & update (by url & type) - possibly update fragment ids - update coalescing - model all dependencies between two pages for one event type as one edge (selectors as separate field) - if entire page was re-rendered from scratch at time X, then fragments are also up to date with X Possible table schema, using [RESTBase table schema syntax](https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md): ``` { table: 'dependencies', attributes: { child: 'string', type: 'int', priority: 'int', parent: 'string', parent_selector: 'json' child_selector: 'json', needs_update: 'timestamp', last_updated: 'timestamp', created: 'timestamp', }, index: [ { type: 'hash', attribute: 'child' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'asc' }, // pull first { type: 'range', attribute: 'parent' }, ], secondaryIndexes: { by_parent: [ { type: 'hash', attribute: 'parent' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'desc' }, // push first { type: 'range', attribute: 'child' }, { type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering { type: 'proj', attribute: 'last_updated' }, ] } } ``` - type: creation / deletion (red links) or modification (most jobs) - priority establishes push / pull via threshold - needs-update & updated timestamps for hybrid change propagation - could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*

The dependency graph forms a directed acyclic graph (DAG), which we need to persist. Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. Generally, the dependency graph tends to be fairly shallow. Currently, most relationships are modeled as a single level (link tables) without any recursion. We expect the depth to increase slightly in the future, but expect the overall depth to still remain very moderate. While potentially useful, native support for recursive queries is not central for the dependency tracking system. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit. Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach based on [Flock](https://github.com/twitter/flockdb), listing transitive queries as an explicit non-goal of their graph storage system. ## Design sketch Considerations: - - **on edit**, we are looking for all dependent items - need `priority` for push/pull decision; process updates in priority order - need local (source) fragment ids (or some selector blob) to determine match - need destination fragment ids for processing - on pull, need to quickly find all items that need to be- **on pull-updated**: - order by `priority` asc (low prio / pull firstcheck whether anything in the page needs updating (`needs_update != null`) - if so, quickly find all items that need to be pull-updated (`child_selector`) - need local (destination) fragment ids- order processing by `priority` asc (low prio / pull first) - - **on dependency update, we need an efficient way to select all previous dependencies in order to diff & updatedate**: - okay to update in 'forward' direction (source -> target- need an efficient and reliable way to select all previous dependencies in order to diff & update (by url & type) - need to- possibly update fragment ids - want to- update coalesce updatesing - group updates to all fragments per page- model all dependencies between two pages for one event type as one edge (selectors as separate field) - if entire page was re-rendered from scratch at time X, then fragments are also up to date with X Possible table schema, using [RESTBase table schema syntax](https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md): ``` { table: 'dependencies', attributes: { child: 'string', type: 'int', priority: 'int', parent: 'string', parent_selector: 'json' child_selector: 'json', needs_update: 'timestamp', last_updated: 'timestamp', created: 'timestamp', }, index: [ { type: 'hash', attribute: 'child' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'asc' }, // pull first { type: 'range', attribute: 'parent' }, ], secondaryIndexes: { by_parent: [ { type: 'hash', attribute: 'parent' }, { type: 'hash', attribute: 'type' }, { type: 'range', attribute: 'priority', order: 'desc' }, // push first { type: 'range', attribute: 'child' }, { type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering { type: 'proj', attribute: 'last_updated' }, ] } } ``` - type: creation / deletion (red links) or modification (most jobs) - priority establishes push / pull via threshold - needs-update & updated timestamps for hybrid change propagation - could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*