TheOur dependency graph forms a directed acyclic graph (DAG), which we need to persist. Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical.
Generally, the dependency graph tends to be fairly shallow. Currently, most relationships are modeled as a single level (link tables) without any recursion. We expect the depth to increase slightly in the future, but expect the overall depth to still remain very moderate. While potentially useful, native support for recursive queries is not central for the dependency tracking system.
In our largest projects, link tables are also starting to get very large. Since most queriesWhile we expect some increase in the depth of the graph as we are fixed to a titlemoving to finer-grained content composition, a sharded key-value storage solution that distributes the graph across many nodes can be a good fitthe overall depth should still remain very limited.
Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. In comparison, recursive graph query functionality is of fairly low importance. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit.
Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach originally based on [Flock](https://github.com/twitter/flockdb), listing transitive queries as an explicit non-goal of their graph storage system.
## Design sketch
Considerations:
- **on edit**, we are looking for all dependent items
- need `priority` for push/pull decision; process updates in priority order
- need local (source) fragment ids (or some selector blob) to determine match
- need destination fragment ids for processing
- **on pull**:
- check whether anything in the page needs updating (`needs_update != null`)
- if so, quickly find all items that need to be pull-updated (`child_selector`)
- order processing by `priority` asc (low prio / pull first)
- **on dependency update**:
- need an efficient and reliable way to select all previous dependencies in order to diff & update (by url & type)
- possibly update fragment ids
- - **update coalescing**
- model all dependencies between two pages for one event type as one edge (selectors as separate field)
- if entire page was re-rendered from scratch at time X, then fragments are also up to date with X
Possible table schema, using [RESTBase table schema syntax](https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md):
```
{
table: 'dependencies',
attributes: {
child: 'string',
type: 'int',
priority: 'int',
parent: 'string',
parent_selector: 'json'
child_selector: 'json',
needs_update: 'timestamp',
last_updated: 'timestamp',
created: 'timestamp',
},
index: [
{ type: 'hash', attribute: 'child' },
{ type: 'hash', attribute: 'type' },
{ type: 'range', attribute: 'priority', order: 'asc' }, // pull first
{ type: 'range', attribute: 'parent' },
],
secondaryIndexes: {
by_parent: [
{ type: 'hash', attribute: 'parent' },
{ type: 'hash', attribute: 'type' },
{ type: 'range', attribute: 'priority', order: 'desc' }, // push first
{ type: 'range', attribute: 'child' },
{ type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering
{ type: 'proj', attribute: 'last_updated' },
]
}
}
```
- type: creation / deletion (red links) or modification (most jobs)
- priority establishes push / pull via threshold
- needs-update & updated timestamps for hybrid change propagation
- could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*