Page MenuHomePhabricator

Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs
Open, HighPublic

Description

In order to support updating the subgraphs defined in Wikidata:SPARQL_query_service/WDQS_graph_split the streaming updater must be adapted to produce the right mutations for a given subgraph.

General Idea

Right after fetching the entity content from Special:EntityData and before generating the diff a new component will be added to apply the set of rules defining the subgraph and will populate a stream per subgraph:

  • rdf-streaming-updater.mutations will remain and will contain all the mutations to update the existing setup
  • rdf-streaming-updater.mutations-main will be added and will contain mutations related to the main
  • rdf-streaming-updater.mutations-scholarly will be added and will contain mutations related to the scholarly

The consumer component on the other end of the stream might require very few (no?) modifications and might just work by configuring the kafka topic it should consume.

The ticket describes the general case to support an arbitrary number of subgraphs, there could be certainly a simpler way to handle this with the 2 subgraphs special case but it might error prone and less future proof so we should attempt to solve the general case. If difficulties are found while implementing it we can always reconsider and give up on the general case.

Stubs

Stubs are artificial triples (explained in WDQS_Split_Refinement#Add_triples_to_help_navigate_between_the_subgraphs) that will take the following form:

wd:Q42 wikibaseqs:subgraph wdsubgraph:main
  • wikibaseqs (with suggested IRI: http://wikiba.se/queryservice#) a new namespace to hold the vocabulary of things related to the wdqs codebase and will probably be hard-coded there, subgraph will be the first and only term required at the moment
  • wdsubgraph (with suggested IRI: https://query.wikidata.org/subgraph/) a new namespace to hold the IRIs identifying subgraphs in the scope wikidata query service. It will be setup in the prefixes.json config file.

Rules

The rules should be extremely simple to apply and will require only the data available locally after fetch the entity content. How the rules are expressed is up for discussion but could be a simple yaml file:

subgraphs:
  - name: scholarly
    subgraph_uri: wdsubgraph:scholarly
    stream: rdf-streaming-updater.mutations-scholarly
    default: block
    rules:
      - pass ?entity wdt:P31 wd:Q13442814
      - pass ?entity rdfs:type wikibase:Property
    stubs_source: true
  - name: main
    subgraph_uri: wdsubgraph:main
    stream: rdf-streaming-updater.mutations-main
    default: pass
    rules:
      - block ?entity wdt:P31 wd:Q13442814
    stubs_source: true
  - name: full
    stream: rdf-streaming-udpater.mutations
    default: pass
    stubs_source: false
  • rules are prefixed with pass or block telling what to do if evaluated to true
  • ?entity will be replaced by the entity URI being updated
  • [] means any rdf literal

The rules are applied in order and stop at the first match, if it's a pass the entity should enter this subraph if it's a block it should not pass. If no rule matches then entity uses the default setting to decide to either pass or block.
The stubs_source attribute will determine if this graph is OK to be linked from a stub triple when an entity is blocked from another subgraph.

Rules Outcome

Applying the rules will only answer the question: does this entity revision belong to this graph?
The actual set of mutations to apply may also vary depending of the type of MutationOperation to apply.

Diff

When a diff is required the rules will be applied from both the previous and next revision.

  • when an entity enters a subgraph:
    • Diff: to remove stubs
    • FullImport: the entity is fully imported
  • when an entity leaves a subgraph:
    • DeleteItem: the entity is fully delete
    • Diff: to add the stubs
  • when an entity stays in the subgraph
    • Diff: simple diff
  • when an entity stays outside the subgraph
    • Diff: special case for diffing stubs

FullImport

  • the entity matches a subgraph
    • normal FullImport
  • the entity does not match a subgraph
    • Diff to import the stubs

DeleteItem
No need to apply the rules: the DeleteItem should be propagated to all subgraphs (deletions are quite rare). Note that DeleteItem can be coming from a reconciliation, this should not change the outcome in such cases.

Reconcile

  • the entity matches a subgraph
    • normal Reconcile
  • the entity does not match a subgraph
    • DeleteItem to cleanup
    • Diff to import the stubs

Caveats: stubs are added/removed by abusing the Diff operation, while it does seem to work in principle it will, in the case it is combined with a FullImport or a Delete, stop the compression of patches on the consumer side. Since this situation happens only on edge cases: entering/leaving a graph and reconciliations we can assume that these are rare enough and do not deserve a special case.

Notes

Model Changes
Beside the changes required in the internal class model of the streaming-udpater-producer flink job there appears to be no need to do any changes to MutationEventData the data class used to represent the event between the producer and the consumer.
But for clarity we might still consider changing it to introduce a new field representing the subgraph IRI (optional). It is not strictly necessary since the events related to the different subgraphs should flow in different streams but could help clarify the target of an event just reading its content.

Lag propagation

Keeping stubs up to date will also have the effect of propagating lag information across all subgraphs. Even in the case of a Diff resulting in no changes in the subgraph members a diff to update the stubs (most likely resulting in a noop) will be emitted. In other words for any kind of input events to the streaming-updater-producer there should be at least one output event for each subgraphs stream.

AC:

  • The streaming-udpater-producer is able to produce multiple streams based on configuration file defining the nature of multiple subgraphs
  • The streaming-updater-consumer is able to consume one of these streams and update its underlying blazegraph db

Event Timeline

dcausse updated the task description. (Show Details)

Change #1018323 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Add a basic model for defining sugraphs

https://gerrit.wikimedia.org/r/1018323

Gehel triaged this task as High priority.Apr 15 2024, 1:22 PM

Change #1018323 merged by jenkins-bot:

[wikidata/query/rdf@master] Add a basic model for defining subgraphs

https://gerrit.wikimedia.org/r/1018323

Change #1033386 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[wikidata/query/rdf@master] Introducing SubgraphRuleMatcher

https://gerrit.wikimedia.org/r/1033386