In order to support updating the subgraphs defined in [[https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split|Wikidata:SPARQL_query_service/WDQS_graph_split]] the streaming updater must be adapted to produce the right mutations for a given subgraph.
== General Idea ==
Right after fetching the entity content from `Special:EntityData` and before generating the diff a new component will be added to apply the set of rules defining the subgraph and will populate a stream per subgraph:
- `rdf-streaming-updater.mutations` will remain and will contain all the mutations to update the existing setup
- `rdf-streaming-updater.mutations-main` will be added and will contain mutations related to the `main`
- `rdf-streaming-updater.mutations-scholarly` will be added and will contain mutations related to the `scholarly`
The consumer component on the other end of the stream might require very few (no?) modifications and might just work by configuring the kafka topic it should consume.
The ticket describes the general case to support an arbitrary number of subgraphs, there could be certainly a simpler way to handle this with the 2 subgraphs special case but it might error prone and less future proof so we should attempt to solve the general case. If difficulties are found while implementing it we can always reconsider and give up on the general case.
== Stubs ==
//Stubs// are artificial triples (explained in [[https://www.wikidata.org/wiki/User:DCausse_(WMF)/WDQS_Split_Refinement#Add_triples_to_help_navigate_between_the_subgraphs|WDQS_Split_Refinement#Add_triples_to_help_navigate_between_the_subgraphs]]) that will take the following form:
```lang=ttl
wd:Q42 wikibaseqs:subgraph wdsubgraph:main
```
* `wikibaseqs` (with suggested IRI: `http://wikiba.se/queryservice#`) a new namespace to hold the vocabulary of things related to the wdqs codebase and will probably be hard-coded there, `subgraph` will be the first and only term required at the moment
* `wdsubgraph` (with suggested IRI: `https://query.wikidata.org/subgraph/`) a new namespace to hold the IRIs identifying subgraphs in the scope wikidata query service. It will be setup in the `prefixes.json` config file.
== Rules ==
The rules should be extremely simple to apply and will require only the data available locally after fetch the entity content. How the rules are expressed is up for discussion but could be a simple yaml file:
```lang=yaml
subgraphs:
- scholarly:
subgraph_uri: "https://query.wikidata.org/subgraph/scholarly"
stream: "rdf-streaming-updater.mutations-scholarly"
default: block
rules:
- "pass ?entity wdt:P31 wd:Q13442814"
- "pass ?entity rdf:type wikibase:Property"
stubs_source: true
- main:
subgraph_uri: "https://query.wikidata.org/subgraph/main"
stream: "rdf-streaming-updater.mutations-main"
default: pass
rules:
- "block ?entity wdt:P31 wd:Q13442814"
stubs_source: true
- full:
stream: "rdf-streaming-udpater.mutations"
default: pass
stubs_source: false
```
- rules are prefixed with `pass` or `block` telling what to do if evaluated to true
- `?entity` will be replaced by the entity URI being updated
- `[]` means any rdf literal
The rules are applied in order and stop at the first match, if it's a `pass` the entity should enter this subraph if it's a block it should not pass. If no rule matches then entity uses the `default` setting to decide to either `pass` or `block`.
The `stubs_source` attribute will determine if this graph is OK to be linked from a stub triple when an entity is blocked from another subgraph.
== Rules Outcome ==
Applying the rules will only answer the question: //does this entity revision belong to this graph?//
The actual set of mutations to apply may also vary depending of the type of [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/streaming-updater-producer/src/main/scala/org/wikidata/query/rdf/updater/MutationOperation.scala|MutationOperation]] to apply.
**Diff**
When a diff is required the rules will be applied from both the previous and next revision.
- when an entity enters a subgraph:
-- Diff: to remove stubs
-- FullImport: the entity is fully imported
- when an entity leaves a subgraph:
-- DeleteItem: the entity is fully delete
-- Diff: to add the stubs
- when an entity stays in the subgraph
-- Diff: simple diff
- when an entity stays outside the subgraph
-- Diff: special case for diffing stubs
**FullImport**
- the entity matches a subgraph
-- normal FullImport
- the entity does not match a subgraph
-- Diff to import the stubs
**DeleteItem**
No need to apply the rules: the DeleteItem should be propagated to all subgraphs (deletions are quite rare). Note that DeleteItem can be coming from a reconciliation, this should not change the outcome in such cases.
**Reconcile**
- the entity matches a subgraph
-- normal Reconcile
- the entity does not match a subgraph
-- DeleteItem to cleanup
-- Diff to import the stubs
Caveats: stubs are added/removed by //abusing// the Diff operation, while it does seem to work in principle it will, in the case it is combined with a FullImport or a Delete, [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/streaming-updater-consumer/src/main/java/org/wikidata/query/rdf/updater/consumer/PatchAccumulator.java#134|stop the compression]] of patches on the consumer side. Since this situation happens only on edge cases: entering/leaving a graph and reconciliations we can assume that these are rare enough and do not deserve a special case.
== Notes ==
**Model Changes**
Beside the changes required in the internal class model of the streaming-udpater-producer flink job there appears to be no need to do any changes to [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/streaming-updater-common/src/main/java/org/wikidata/query/rdf/updater/MutationEventData.java|MutationEventData]] the data class used to represent the event between the producer and the consumer.
But for clarity we might still consider changing it to introduce a new field representing the subgraph IRI (optional). It is not strictly necessary since the events related to the different subgraphs should flow in different streams but could help clarify the target of an event just reading its content.
**Lag propagation**
Keeping stubs up to date will also have the effect of propagating lag information across all subgraphs. Even in the case of a Diff resulting in no changes in the subgraph members a diff to update the stubs (most likely resulting in a noop) will be emitted. In other words for any kind of input events to the streaming-updater-producer there should be at least one output event for each subgraphs stream.
AC:
- The `streaming-udpater-producer` is able to produce multiple streams based on configuration file defining the nature of multiple subgraphs
- The `streaming-updater-consumer` is able to consume one of these streams and update its underlying blazegraph db