Page MenuHomePhabricator

Introduce a centralized Dependency Tracking Service
Open, Needs TriagePublic

Description

Problem: MediaWiki derives secondary representations from the primary content stored in the system, and caches these secondary representations. When the primary content changes, the caches need to be updated. In short, we have one of the two famously hard problems.

The traditional example is: The rendered version of a wiki page (HTML) needs to be updated not just when the pages content (wikitext) is edited, but also whenever any template or module or image used on the page changes; or a template used by a template used by a template. When this happens, the fully HTML page that is cached in the CDN layer must also be updated.

A more advanced example would be: A Wikipedia about a city uses information from Wikidata to show the name of the city's major. The page needs to be updated when the Mayor's name changes.

With increasing complexity and granularity of the content modeled by MediaWiki, and increasing flexibility in the re-use and re-combination of content, the number and complexity of such chains of dependencies increases. Inventing a new system for tracking dependencies, and updating content, for each use case, is not sustainable, and prone to problems especially when the different systems of dependencies start to interact.

Proposed solution: Create a centralized service for tracking dependencies. Any component that derives some kind of content from others should use this service to track the dependency, and be notified when content needs to be regenerated or discarded.

Scale: If used for all existing and anticipated use cases, the service should be able to scale to at least a billion nodes, ten billion edges, and a thousand updates per second. It should be designed for growing by another two orders of magnitude over the next ten years.

Persistence: the information maintained by the dependency service can be rebuilt and is self-healing, but losing all of it would cause stale data to be served for a long time, until all resources have been updated once.

Use Cases

  • update parser cache (or multiple caches of rendered content)
  • replace page_links, template_links, etc for triggering "backlink" updates
  • for purging cdn cache (should integrate with tagging/bucketing support in the cdn layer)
  • replace wikidata change dispatching
  • regenerate cached graphs, diagrams, and maps
  • regenerate cached page extracts ("summaries")
  • trigger recurring jobs
  • update derived data like resource modules, translation tables, etc
  • ...

System Draft

  • The service receives events of the type "resource updated". The event contains at least the URI of the resource, and a list of URIs of resources used (the dependencies). Additional information that may prove useful to have in the event may include a resource type, a timestamp, and a hash. Such events can also be triggered by directly calling an API on the service.
  • The service emits events of the type "resource needs refresh". The event contains at least the URI of the resource. Additional information that may prove useful to have in the event may include a resource type, a timestamp, and a hash. The service also offers an API for polling the status of a given resource.
  • The service stores the dependency graph a way that allows it to be queried in either direction: for any given resource, it can efficiently list all resources it directly depends on, and all resources that directly depend on it. No recursive traversal is needed.
  • The service operates as a state machine for each resource. At a given time, each resource is in one of the following states: clean, dirty, ready, and stub.
    • stub means the resource has been referenced as a dependency, but itself has never been updated (or it has since been deleted).
    • clean means the resource is up to date. That is, nothing it depends on has changed since the resource was last updated.
    • dirty means that the resource (directly) depends on something that has changed since the resources was last updated, and some resources it depends on are not clean (or stub).
    • ready means that the resource (directly) depends on something that has changed since the resources was last updated, and all resources it depends on are clean (or stub).
  • Each resource transitions between the states as follows:
    • When an update event is received for a resource, it is marked as clean if everything it depends on is clean, and older than the resource itself. Otherwise, it is marked as dirty (if not all dependencies are clean) or ready (if all dependencies are clean).
    • When an update event is received for a resource, anything that depends on it and is currently marked as clean is checked to see if it should transition into the dirty state (or directly to ready, see below).
    • When an update event is received for a resource, anything that depends on it and is currently marked as dirty is checked to see if it can now transition to the ready state.
    • When a resource goes into the ready state, a "resource needs refresh" event is emitted.
  • Optimization
    • When a resource has been in the dirty state for too long, it goes into ready even if not all its dependencies are clean. This is to prevent starvation.
    • When a resource has been in the ready state for too long, another "resource needs refresh" event is, based on the assumption that the first event was lost or updating failed for some reason. The number of retries should be limited, a randomized incremental back-off could be used to work around temporary outages.
    • An additional state ("pending") could be used to indicate that a client has committed to updating the resources, so that multiple clients that poll for resources in need of updating can avoid duplicating work.
    • create "ticker" resources that change at regular intervals, to trigger updates for things that need to be updated on a regular basis

Diagram:

(image on commons, image source on Google Docs)

See also:

Event Timeline

daniel created this task.May 18 2020, 2:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2020, 2:47 PM
daniel updated the task description. (Show Details)May 18 2020, 3:57 PM
daniel updated the task description. (Show Details)May 21 2020, 10:10 PM
daniel updated the task description. (Show Details)May 29 2020, 12:21 PM
tstarling renamed this task from Introduce a centraliced Dependency Tracking Service to Introduce a centralized Dependency Tracking Service.Jun 4 2020, 9:58 PM

Quick brain dump of state transitions for every resource:

Rough API draft

Amire80 updated the task description. (Show Details)Aug 17 2020, 8:49 PM

@Amire80! You turned my scribblings into a pretty diagram! Thank you!

This would be very useful for everything.
And super useful for all things Wikidata, especially as we see increased usage and more structured data in a similar form, such as StructuredDataOnCommons mediainfo, and other potential up coming things such as collections etc.

@Amire80! You turned my scribblings into a pretty diagram! Thank you!

You're welcome. Whatever I can do to help make this happen :)

Should this task be a blocker of T41610 ?

And should T201526 be a blocker of this task?

Should this task be a blocker of T41610 ?
And should T201526 be a blocker of this task?

Both relations are reasonable, but creating the Dependency Engine is not the only option, and we are not yet committed to it. We'd need a relationship like "would enable" or "would benefit from" :)

daniel updated the task description. (Show Details)Aug 19 2020, 3:31 PM