Page MenuHomePhabricator

Evaluate whether RDF Delta is a good idea to have in the backend
Open, LowPublic

Event Timeline

We should probably start with a problem we're be trying to solve. What would this be for this one?

We should probably start with a problem we're be trying to solve. What would this be for this one?

Good idea.

So as a data consumer I want to know which triples have been changed between 2 dumps from Wikidata.

As an enterprise company I want to replicate Wikidatas triple store inhouse and therefore consume the RDF Delta to do queries on own infrastructure.

Does that sound reasonable?

Gehel triaged this task as Medium priority.Sep 2 2021, 1:23 PM
Gehel moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

RDF-Delta can be used with Jena-Fuseki to replicate Jena DBs across servers for HA. I will be investigating it as part of the Blazegraph Alternatives analysis.

If I understand it correctly, RDF-Delta itself requires a replicable store for HA?

I do not believe so, but will investigate it further. You CAN use Zookeeper to also create a high availability RDF-Delta service.

@nguyenm9 - RDF Delta adds the replication of updates across a number of stores (Apache jena Fuseki).

@AWesterinen I suppose the concern would be if a machine were taken out of rotation for maintenance, but in the mean time there were lots of updates, can Zookeeper reliably store all the updates? Again, if I u/d RDF-Delta correctly, it's journaling all the updates so that a machine can replay that journal on demand. That journal needs to be stored in a HA and durable way.

Zookeeper keeps the index, the HA storage is usually in something with an S3 interface to use as a blob store.

We should probably start with a problem we're be trying to solve. What would this be for this one?

Good idea.

So as a data consumer I want to know which triples have been changed between 2 dumps from Wikidata.

As an enterprise company I want to replicate Wikidatas triple store inhouse and therefore consume the RDF Delta to do queries on own infrastructure.

+
As a Wikimedia sysop I want to keep a cluster of Jena Fuseki servers up to date, so I set them up to follow a master patch server and consume the RDF Delta ONLY from there.

(I am wildly guessing here, I have no idea if RDF Delta has any value in house since we already have the new streaming updater, but it might scale better or provide benefits over what we have now)

Or
As a Wikimedia sysop I have to take a machine out for maintenance and afterwards it is missing 3 days of updates, so I want to to use RDF Delta from a patch server to bring it up to date without having to copy all the triples from scratch.

See https://afs.github.io/rdf-delta/ha-fuseki.html for details

This comment was removed by So9q.

Since updates are not part of the WDQS feature set (no SPARQL INSERT/DELETE), the issues are whether 1) RDF-Delta has value for users with local installs and/or 2) improves the new stream updater.

MPhamWMF lowered the priority of this task from Medium to Low.Mar 29 2022, 1:33 PM