Page MenuHomePhabricator

Generate RDF from JSON
Open, Stalled, MediumPublic

Description

Instead of generating RDF dumps from the database, have a maintenance script that reads a JSON dump, and generates RDF output from that. This would allow use to generate consistent RDF dumps for various scopes, flavors and formats, with consistent data. It is also likely to be faster than loading entities from the external storage database (depending on FS access details).

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a project: Wikidata.
daniel added a subscriber: daniel.
Lydia_Pintscher set Security to None.
JanZerebecki lowered the priority of this task from High to Medium.Jul 23 2015, 3:07 PM
Smalyshev changed the task status from Open to Stalled.Apr 4 2018, 8:31 PM

As far as I can see, nothing happened here for a year. Moreover, subtasks are also dormant for a year. So it seems to be stalled. But if I'm wrong and something is happening here, please reclassify.

I think Wikidata-Toolkit could be used for that:
https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-rdf/src/main/java/org/wikidata/wdtk/rdf/RdfSerializer.java
Obviously it would mean making sure the RDF serialization produced by it is consistent with what is being fed in WDQS at the moment.

The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.

Change 617153 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Experimental support for creating dumps from JSON dumps

https://gerrit.wikimedia.org/r/617153

Addshore added a subscriber: Addshore.

The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.

Indeed, and it already gets the JSON dumps loaded into it.
However this would mean keeping the JSON -> RDF mapping in multiple places (both in PHP in Wikbiase & elsewhere to interface with hadoop).
Though that sounds like something that we could deal with in some way?

Info: There already is in the cluster a job doing TTL -> RDF conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available.
The job is maintained by the Search Platform team (ping @dcausse ' :).

Indeed, the RDF data is available in the hive table discovery.wikibase_rdf but it is generated reading the TTL dumps so it might not help for this particular task.
Using hadoop will indeed allow to process the json efficiently but has drawbacks as already pointed out:

  • requires maintaining the Wikibase -> RDF projection in multiple codebases (PHP wikibase & in spark)
  • once created from the hadoop cluster it will have to be pushed back to the labstore machine for public consumption and might add extra delay