Page MenuHomePhabricator

Generate RDF from JSON
Open, Stalled, MediumPublic

Description

Instead of generating RDF dumps from the database, have a maintenance script that reads a JSON dump, and generates RDF output from that. This would allow use to generate consistent RDF dumps for various scopes, flavors and formats, with consistent data. It is also likely to be faster than loading entities from the external storage database (depending on FS access details).

Related Objects

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a project: Wikidata.
daniel added a subscriber: daniel.
Lydia_Pintscher set Security to None.
JanZerebecki lowered the priority of this task from High to Medium.Jul 23 2015, 3:07 PM
Smalyshev changed the task status from Open to Stalled.Apr 4 2018, 8:31 PM

As far as I can see, nothing happened here for a year. Moreover, subtasks are also dormant for a year. So it seems to be stalled. But if I'm wrong and something is happening here, please reclassify.

I think Wikidata-Toolkit could be used for that:
https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-rdf/src/main/java/org/wikidata/wdtk/rdf/RdfSerializer.java
Obviously it would mean making sure the RDF serialization produced by it is consistent with what is being fed in WDQS at the moment.

The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.

Change 617153 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Experimental support for creating dumps from JSON dumps

https://gerrit.wikimedia.org/r/617153

Addshore added a subscriber: Addshore.

The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.

Indeed, and it already gets the JSON dumps loaded into it.
However this would mean keeping the JSON -> RDF mapping in multiple places (both in PHP in Wikbiase & elsewhere to interface with hadoop).
Though that sounds like something that we could deal with in some way?

Info: There already is in the cluster a job doing TTL -> RDF conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available.
The job is maintained by the Search Platform team (ping @dcausse ' :).

Indeed, the RDF data is available in the hive table discovery.wikibase_rdf but it is generated reading the TTL dumps so it might not help for this particular task.
Using hadoop will indeed allow to process the json efficiently but has drawbacks as already pointed out:

  • requires maintaining the Wikibase -> RDF projection in multiple codebases (PHP wikibase & in spark)
  • once created from the hadoop cluster it will have to be pushed back to the labstore machine for public consumption and might add extra delay

When T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth is ready we could probably change some of the architecture and process around dumping for Wikidata.org

We would likely keep the existing scripts as they are for the Wikibase usecases, and may still want to create a script that generates TTL/RDF from a JSON dump

For Wikidata.org we could move towards
edit--> kafka --> streaming job --> WMF-API(get content) --> store on HDFS
And then from HDFS generate JSON, RDF, TTL dumps much faster and consistently

This will likely tie into the ongoing wikidata / wikibase subsetting discussions too, as subsetting dumps from HDFS will be much easier than while using the existing systems.
See T46581: Partial dumps etc.

But most of this probably lives in separate tickets.