Page MenuHomePhabricator

RFC: Dependency graph storage; sketch: adjacency list in DB
Closed, InvalidPublic

Description

Our dependency graph forms a directed acyclic graph (DAG), which we need to persist. Currently, most relationships are modeled as a single level (link tables) without any recursion. While we expect some increase in the depth of the graph as we are moving to finer-grained content composition, the overall depth should still remain very limited.

Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. In comparison, recursive graph query functionality is of fairly low importance. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit.

Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach originally based on Flock, listing transitive queries as an explicit non-goal of their graph storage system.

Design sketch

Considerations:

  • on edit, we are looking for all dependent items
    • need priority for push/pull decision; process updates in priority order
    • need local (source) fragment ids (or some selector blob) to determine match
    • need destination fragment ids for processing
  • on pull:
    • check whether anything in the page needs updating (needs_update != null)
    • if so, quickly find all items that need to be pull-updated (child_selector)
    • order processing by priority asc (low prio / pull first)
  • on dependency update:
    • need an efficient and reliable way to select all previous dependencies in order to diff & update (by url & type)
    • possibly update fragment ids
  • update coalescing
    • model all dependencies between two pages for one event type as one edge (selectors as separate field)
    • if entire page was re-rendered from scratch at time X, then fragments are also up to date with X

Possible table schema, using RESTBase table schema syntax:

{
  table: 'dependencies',
  attributes: {
    child: 'string',
    type: 'int',
    priority: 'int',
    parent: 'string',
    parent_selector: 'json'
    child_selector: 'json',
    needs_update: 'timestamp',
    last_updated: 'timestamp',
   created: 'timestamp',
  },
  index: [
    { type: 'hash', attribute: 'child' },
    { type: 'hash', attribute: 'type' },
    { type: 'range', attribute: 'priority', order: 'asc' }, // pull first
    { type: 'range', attribute: 'parent' },
  ],
  secondaryIndexes: {
    by_parent: [
      { type: 'hash', attribute: 'parent' },
      { type: 'hash', attribute: 'type' },
      { type: 'range', attribute: 'priority', order: 'desc' }, // push first
      { type: 'range', attribute: 'child' },
      { type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering
      { type: 'proj', attribute: 'last_updated' },
    ]
  }
}
  • type: creation / deletion (red links) or modification (most jobs)
  • priority establishes push / pull via threshold
  • needs-update & updated timestamps for hybrid change propagation
  • could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*

See also

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Services.
GWicke subscribed.
GWicke renamed this task from Sketch for dependency graph storage: simple adjacency list in DB to Sketch for dependency graph storage: adjacency list in DB.Jul 14 2015, 9:50 PM
GWicke updated the task description. (Show Details)
GWicke set Security to None.

In T105975 we realized that we'll also need to propagate suppressions / revdeletions along the dependency graph. It also highlights a use case where dependencies are the natural way to determine whether a bit of content is still needed -- essentially a form of refcounting.

This might mean that we'll need to track dependencies for both current *and* old revisions.

GWicke renamed this task from Sketch for dependency graph storage: adjacency list in DB to RFC: Dependency graph storage; sketch: adjacency list in DB.Jul 23 2015, 6:00 PM
GWicke added a project: TechCom-RFC.

Phabricator evidently doesn't detect subscriber change collisions.

Please have a look what Wikibase is doing with the wbc_entity_usage table to see if this design would cover the same functionality.
Especially have a look at eu_aspect and note that it uses prefix matching.

RobLa-WMF mentioned this in Unknown Object (Event).May 11 2016, 12:09 AM

Here is a link to the entity_usage schema used by Wikidata. This table is created for each project, and records usage of wikidata items within that project. It already records relatively fine-grained aspect references.

Krinkle subscribed.

Closing old RFC that is not yet on to our 2020 process and does not appear to have an active owner. Feel free to re-open with our template or file a new one when that changes.

Jdforrester-WMF changed the task status from Declined to Invalid.Sep 16 2020, 8:16 PM
Jdforrester-WMF subscribed.

Not Declined.