Page MenuHomePhabricator

[Story] Decide the back-end re-implementation
Closed, ResolvedPublic

Description

Propose to the community a set of solutions to ensure the self-sustainability of the back-end.
The current implementation is in C++.

Possible alternatives, ordered by priority:

  1. PHP, which would enable the re-use of the Wikidata data model;
  2. Node.js
  3. a WDQS instance, cf. T166501#3320547

Event Timeline

Hjfocs triaged this task as High priority.Jun 6 2017, 1:57 PM

see T167025#3318577 for how to set up a local Wikidata instance with Vagrant.

It seems that the Wikidata Query Service can be a good fit too, for the following reasons:

  1. uses Blazegraph as the storage engine, cf. T166503;
  2. has facilities to load & upload datasets in Wikidata RDF dump format;
  3. exposes APIs to access data via SPARQL (specifically useful for both the domain filter and the query text box, cf. T166512).

Instructions to install, between the WDQS user manual and the getting started:

  1. Download the latest packaged version at Maven Central and unzip it:*
wget -O wdqs.zip http://search.maven.org/remotecontent?filepath=org/wikidata/query/rdf/service/0.2.4/service-0.2.4-dist.zip
unzip -d wdqs wdqs.zip
cd wdqs
  1. Download the latest Wikidata RDF Turtle gzipped dump:
mkdir -p data/chunks
wget -O data/wikidata.ttl.gz https://dumps.wikimedia.org/wikidatawiki/entities/20170529/wikidata-20170529-all-BETA.ttl.gz
  1. Pre-process the dump:
./munge.sh -f data/wikidata.ttl.gz -d data/chunks -l it -s
  1. Start Blazegraph (in the background):
./runBlazegraph.sh &
  1. Load a data chunk (loading the whole Wikidata dump is computationally cumbersome)
./loadRestAPI.sh -n wdq -d `pwd`/data/chunks/wikidump-000000001.ttl.gz
  1. Blazegraph is ready for query at its SPARQL endpoint: http://localhost:9999/bigdata/#query

*N.B.: as of today, compiling the source code will fail due to missing blazegraph-2.1.5-SNAPSHOT dependencies on remote repositories.

If we use RDF, we could feed the primary sources list/filter sub-tool with truthy statements: once the sanity of a given RDF dataset is checked via the data model validator, we can then serialize the response of the queried WDQS SPARQL endpoint into a HTML table.

On the other hand, the per-item tool would support full statements.

We still need to investigate which Wikibase data model implementation to use, basically for the data model validator implementation, cf. T167030.

We assume that the following objects are considered stable, as they are claimed to be subject to the Wikidata stable interface policy:

On the other hand, there is no guarantee for the following Wikibase data model implementations:

We assume however that they can at least cater for a sub-set of the data model, cf. the extensibility principle.

The proposed solution is a WDQS instance with RDF data model validation.
T167014 will include the full proposal.