Instead of generating RDF dumps from the database, have a maintenance script that reads a JSON dump, and generates RDF output from that. This would allow use to generate consistent RDF dumps for various scopes, flavors and formats, with consistent data. It is also likely to be faster than loading entities from the external storage database (depending on FS access details).
|mediawiki/extensions/Wikibase||master||+365 -39||Experimental support for creating dumps from JSON dumps|
As far as I can see, nothing happened here for a year. Moreover, subtasks are also dormant for a year. So it seems to be stalled. But if I'm wrong and something is happening here, please reclassify.
I think Wikidata-Toolkit could be used for that:
Obviously it would mean making sure the RDF serialization produced by it is consistent with what is being fed in WDQS at the moment.
Indeed, and it already gets the JSON dumps loaded into it.
However this would mean keeping the JSON -> RDF mapping in multiple places (both in PHP in Wikbiase & elsewhere to interface with hadoop).
Though that sounds like something that we could deal with in some way?
Indeed, the RDF data is available in the hive table discovery.wikibase_rdf but it is generated reading the TTL dumps so it might not help for this particular task.
Using hadoop will indeed allow to process the json efficiently but has drawbacks as already pointed out:
- requires maintaining the Wikibase -> RDF projection in multiple codebases (PHP wikibase & in spark)
- once created from the hadoop cluster it will have to be pushed back to the labstore machine for public consumption and might add extra delay