RFC: Dependency graph storage; sketch: adjacency list in DB
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• GWicke
	Jul 14 2015, 1:42 AM

Description

Our dependency graph forms a directed acyclic graph (DAG), which we need to persist. Currently, most relationships are modeled as a single level (link tables) without any recursion. While we expect some increase in the depth of the graph as we are moving to finer-grained content composition, the overall depth should still remain very limited.

Because of the large number of dependencies we need to manage (up to ~10 million uses for a single template, for example), efficient incremental paging through graph edges is critical. In comparison, recursive graph query functionality is of fairly low importance. In our largest projects, link tables are also starting to get very large. Since most queries are fixed to a title, a sharded key-value storage solution that distributes the graph across many nodes can be a good fit.

Given these requirements, maintaining the dependency graph as a simple adjacency list could be attractive. Using a distributed database like Cassandra, adjacency lists can scale well by using many nodes, and can benefit from existing replication setups. Their main disadvantage is the absence of direct support for transitive graph queries without client-side iteration. However, given the need for control over how large dependency graphs are processed incrementally, this might not be such a bad thing. Twitter seems to be using a similar approach originally based on Flock, listing transitive queries as an explicit non-goal of their graph storage system.

Design sketch

Considerations:

on edit, we are looking for all dependent items
- need priority for push/pull decision; process updates in priority order
- need local (source) fragment ids (or some selector blob) to determine match
- need destination fragment ids for processing
on pull:
- check whether anything in the page needs updating (needs_update != null)
- if so, quickly find all items that need to be pull-updated (child_selector)
- order processing by priority asc (low prio / pull first)
on dependency update:
- need an efficient and reliable way to select all previous dependencies in order to diff & update (by url & type)
- possibly update fragment ids
update coalescing
- model all dependencies between two pages for one event type as one edge (selectors as separate field)
- if entire page was re-rendered from scratch at time X, then fragments are also up to date with X

Possible table schema, using RESTBase table schema syntax:

{
  table: 'dependencies',
  attributes: {
    child: 'string',
    type: 'int',
    priority: 'int',
    parent: 'string',
    parent_selector: 'json'
    child_selector: 'json',
    needs_update: 'timestamp',
    last_updated: 'timestamp',
   created: 'timestamp',
  },
  index: [
    { type: 'hash', attribute: 'child' },
    { type: 'hash', attribute: 'type' },
    { type: 'range', attribute: 'priority', order: 'asc' }, // pull first
    { type: 'range', attribute: 'parent' },
  ],
  secondaryIndexes: {
    by_parent: [
      { type: 'hash', attribute: 'parent' },
      { type: 'hash', attribute: 'type' },
      { type: 'range', attribute: 'priority', order: 'desc' }, // push first
      { type: 'range', attribute: 'child' },
      { type: 'proj', attribute: 'parent_selector' }, // projection / denormalization for efficient filtering
      { type: 'proj', attribute: 'last_updated' },
    ]
  }
}

type: creation / deletion (red links) or modification (most jobs)
priority establishes push / pull via threshold
needs-update & updated timestamps for hybrid change propagation
could use numeric ids for space savings instead of urls; however, it's not clear that it would be worth the extra IO / queries considering compression in C*

Related Objects

Mentioned In: T253026: Introduce a centralized Dependency Tracking Service
T201004: Spec out dependency engine interface, data structure, and states
T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater
T185233: Modern Event Platform
T174993: Vandalism in "In the news" articles persisting in the app ?
T126641: [RFC] Devise plan for a cross-wiki watchlist back-end
T103429: Investigation: Parser save hook handler does master writes in GETs
T125865: Assign RFCs to ArchCom shepherds
T111819: Services team goals October - December 2015 (Q2 2015/16)
T105975: RFC: Generalize content-addressable POST request storage
T102306: Services team roadmap July - September 2015 (Q1 2015/16)
T102476: RFC: Requirements for change propagation
Mentioned Here: T105975: RFC: Generalize content-addressable POST request storage
T102476: RFC: Requirements for change propagation

Event Timeline

• GWicke created this task.Jul 14 2015, 1:42 AM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added a project: Services.

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2015, 1:42 AM

• GWicke mentioned this in T102476: RFC: Requirements for change propagation.Jul 14 2015, 1:42 AM

• mobrovac subscribed.Jul 14 2015, 8:06 AM

Eevans subscribed.Jul 14 2015, 2:29 PM

• GWicke renamed this task from Sketch for dependency graph storage: simple adjacency list in DB to Sketch for dependency graph storage: adjacency list in DB.Jul 14 2015, 9:50 PM

• GWicke updated the task description. (Show Details)

• GWicke set Security to None.

• GWicke updated the task description. (Show Details)Jul 14 2015, 9:57 PM

• GWicke edited subscribers, added: daniel; removed: Aklapper.Jul 15 2015, 4:59 PM

• GWicke updated the task description. (Show Details)Jul 15 2015, 6:35 PM

• GWicke updated the task description. (Show Details)Jul 15 2015, 7:16 PM

• GWicke updated the task description. (Show Details)Jul 15 2015, 7:51 PM

• GWicke updated the task description. (Show Details)Jul 16 2015, 12:08 AM

• GWicke updated the task description. (Show Details)Jul 16 2015, 12:14 AM

• GWicke mentioned this in T102306: Services team roadmap July - September 2015 (Q1 2015/16).Jul 21 2015, 4:01 PM

• GWicke mentioned this in T105975: RFC: Generalize content-addressable POST request storage.Jul 23 2015, 5:16 PM

In T105975 we realized that we'll also need to propagate suppressions / revdeletions along the dependency graph. It also highlights a use case where dependencies are the natural way to determine whether a bit of content is still needed -- essentially a form of refcounting.

This might mean that we'll need to track dependencies for both current *and* old revisions.

• GWicke renamed this task from Sketch for dependency graph storage: adjacency list in DB to RFC: Dependency graph storage; sketch: adjacency list in DB.Jul 23 2015, 6:00 PM

• GWicke added a project: TechCom-RFC.

• Spage assigned this task to • GWicke.Jul 29 2015, 8:11 PM

• GWicke moved this task from P1: Define to Under discussion on the TechCom-RFC board.Jul 29 2015, 8:20 PM

jcrespo subscribed.Jul 29 2015, 10:00 PM

• Gilles subscribed.Sep 1 2015, 2:34 PM

• GWicke edited subscribers, added: aaron; removed: • Gilles.Sep 1 2015, 2:34 PM

Phabricator evidently doesn't detect subscriber change collisions.

• GWicke mentioned this in T111819: Services team goals October - December 2015 (Q2 2015/16).Sep 9 2015, 5:56 PM

Please have a look what Wikibase is doing with the wbc_entity_usage table to see if this design would cover the same functionality.
Especially have a look at eu_aspect and note that it uses prefix matching.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 4 2015, 9:53 PM

JanZerebecki subscribed.Dec 23 2015, 1:39 PM

• mobrovac added subscribers: Joe, faidon.Jan 25 2016, 5:57 PM

hoo subscribed.Jan 26 2016, 7:41 PM

• RobLa-WMF mentioned this in T125865: Assign RFCs to ArchCom shepherds.Feb 10 2016, 8:15 PM

daniel mentioned this in T103429: Investigation: Parser save hook handler does master writes in GETs.Mar 16 2016, 7:19 PM

• GWicke removed • GWicke as the assignee of this task.Mar 23 2016, 9:12 PM

Danny_B added a project: Proposal.May 2 2016, 10:15 PM

• RobLa-WMF mentioned this in Unknown Object (Event).May 11 2016, 12:09 AM

Here is a link to the entity_usage schema used by Wikidata. This table is created for each project, and records usage of wikidata items within that project. It already records relatively fine-grained aspect references.

Scott_WUaS subscribed.May 18 2016, 9:27 PM

• RobLa-WMF moved this task from Under discussion to (unused) on the TechCom-RFC board.Jun 8 2016, 7:39 PM

daniel mentioned this in T126641: [RFC] Devise plan for a cross-wiki watchlist back-end.Jul 20 2016, 8:43 PM

• RobLa-WMF triaged this task as Medium priority.Jul 27 2016, 8:13 PM

• GWicke moved this task from Backlog to designing on the Services board.Jul 11 2017, 8:27 PM

• GWicke edited projects, added Services (designing); removed Services.

• mobrovac mentioned this in T174993: Vandalism in "In the news" articles persisting in the app ?.Sep 8 2017, 8:44 AM

Krinkle moved this task from (unused) to Under discussion on the TechCom-RFC board.Dec 22 2017, 12:46 AM

• mobrovac mentioned this in T185233: Modern Event Platform.Jul 5 2018, 10:45 AM

• mobrovac added a project: Platform Team Legacy (Designing).Dec 20 2018, 12:54 PM

Ladsgroup subscribed.Dec 23 2018, 3:50 PM

Gehel mentioned this in T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.Mar 12 2019, 9:18 AM

Krinkle moved this task from Under discussion to Old on the TechCom-RFC board.Apr 3 2020, 11:34 PM

Tgr mentioned this in T201004: Spec out dependency engine interface, data structure, and states.Apr 28 2020, 3:46 PM

daniel mentioned this in T253026: Introduce a centralized Dependency Tracking Service.May 29 2020, 12:21 PM

Closing old RFC that is not yet on to our 2020 process and does not appear to have an active owner. Feel free to re-open with our template or file a new one when that changes.

Not Declined.

RFC: Dependency graph storage; sketch: adjacency list in DBClosed, InvalidPublicActions

Description

Design sketch

See also

Related Objects

Event Timeline

RFC: Dependency graph storage; sketch: adjacency list in DB
Closed, InvalidPublic
Actions