Page MenuHomePhabricator

[Epic] Rework the WDQS updater as an event driven application
Closed, ResolvedPublic

Description

The current merging strategy for applying updates require sending all the entity data on every update. The goal of this task is to design a new updater that will be able to send the minimal number of triples to the RDF store to synchronize the graph with the state of wikibase.
Note: this is a very rough plan and many details will probably change as implementation specific requirements will pop up.

The proposed approach relies on a system able to do stateful computation over data streams (flink).

updater_v2r2.png (698×1 px, 79 KB)

Based on a set of source event streams populated by mediawiki and change propagation the steps are:

  1. filter: filter events related to wikibase and its entities
  2. event time reordering: reorder the events and assemble them to a single partitioned stream
  3. rev state evaluation: determine what command needs to applied to mutate the graph
    • this steps require holding a state of previously seen revision and other actions (e.g. visibility change)
    • the output of this is a simple event without any data saying: do a diff between rev X and Y, fully delete entity QXYZ, ...
    • the initial state will be populated using the revisions present in the RDF dump
    • seen revisions (after a fresh import) will be easy to discard
  4. rdf diff generation: materialize the command and fetch the data from wikibase and send it over a RDF stream
    • it's probable that in some cases (suppressed delete) the exact set of triples to be deleted will be unknown and thus will require a special delete command to be applied to the backend
  5. rdf import: The components reading this stream will be very similar to the current updater: a process running locally on the wdqs nodes pushing data to blazegraph

For the first iteration no cleanups will be performed, orphaned values & references will remain in the RDF store. This will be mitigated by more frequent reloads of the dump.
Such system being prone to deviation frequent reloads will be important, it's important to note that the state of step 3 is tightly coupled with its dump and thus we will have to instantiate a new stream per imported dump. In other words a wdqs system imported using dump Y will have to consume the RDF stream generated from an initial state based on this same dump. This means that the RDF stream will be named against a particular dump instance.

Note on event time reordering:

Note on state management:

Note on initial state:

Related Objects

StatusSubtypeAssignedTask
ResolvedGehel
Resolved Zbyszko
Resolveddcausse
Resolved Zbyszko
Resolved Zbyszko
Resolved Zbyszko
Resolveddcausse
Resolveddcausse
Resolved Zbyszko
ResolvedGehel
DeclinedNone
Resolved Zbyszko
Resolved Zbyszko
Resolveddcausse
Resolveddcausse
Resolveddcausse
ResolvedOttomata
ResolvedRKemper
ResolvedMstyles
Resolved Zbyszko
DeclinedNone
OpenNone
Resolveddcausse
ResolvedMstyles
Resolveddcausse
ResolvedMstyles
Resolveddcausse
Resolveddcausse
Resolveddcausse
OpenNone
Resolveddcausse
Resolveddcausse
ResolvedJAllemandou
ResolvedRKemper
DeclinedNone
Resolveddcausse
ResolvedGehel
ResolvedMstyles
ResolvedMstyles
ResolvedMstyles
DeclinedNone
Resolved Zbyszko
Resolveddcausse
ResolvedOttomata
ResolvedMstyles
Resolveddcausse
DeclinedNone
Resolveddcausse
OpenNone
ResolvedGehel
Resolvedjeena
Resolveddcausse
ResolvedMstyles
ResolvedGehel
Resolveddcausse
Resolved Zbyszko
Resolveddcausse
Resolveddcausse
Resolved Zbyszko
Resolved Zbyszko
Resolved Zbyszko
ResolvedGehel
DuplicateNone
Resolved Zbyszko
Resolveddcausse
ResolvedMstyles
Resolveddcausse
Resolved Zbyszko
Resolveddcausse
Resolveddcausse
Resolveddcausse
ResolvedOttomata
Declineddcausse
Resolveddcausse
Declineddcausse
Resolveddcausse
ResolvedGehel
Resolveddcausse
ResolvedEBernhardson
ResolvedGehel

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • the output of this is a simple event without any data saying: do a diff between rev X and Y, fully delete entity QXYZ, ...

Is that supposed to be "data saving" ?

rdf diff generation: materialize the command and fetch the data from wikibase and send it over a RDF stream

Will that be an uncompressed RDF stream ? When mentioning streams in any design its always best to say compressed/uncompressed for accuracy. You might have a note on streams; to mention that and any other nuances about them.

  • the output of this is a simple event without any data saying: do a diff between rev X and Y, fully delete entity QXYZ, ...

Is that supposed to be "data saving" ?

The reason was not really data saving, the main reason is that we don't know the set of triples to delete. There are no ways to ask Wikibase to output the RDF data of a deleted entity.

rdf diff generation: materialize the command and fetch the data from wikibase and send it over a RDF stream

Will that be an uncompressed RDF stream ? When mentioning streams in any design its always best to say compressed/uncompressed for accuracy. You might have a note on streams; to mention that and any other nuances about them.

Currently it's uncompressed turtle (which is far from being the best format for such usecase but it has been the historical format for data exchange in this stack). Note that if I'm not mistaken our kafka setup will compress the data transparently. The data model has been made in such a way that in the future we could use other format such as RDF Thrift or anything else we think more appropriate.

Gehel renamed this task from EPIC: Rework the WDQS updater as an event driven application to [Epic] Rework the WDQS updater as an event driven application.Oct 6 2020, 2:16 PM
Gehel claimed this task.

Completed!