Instead of generating RDF dumps from the database, have a maintenance script that reads a JSON dump, and generates RDF output from that. This would allow use to generate consistent RDF dumps for various scopes, flavors and formats, with consistent data. It is also likely to be faster than loading entities from the external storage database (depending on FS access details).
|mediawiki/extensions/Wikibase||master||+365 -39||Experimental support for creating dumps from JSON dumps|
- Mentioned In
- T216160: Update wikidata-entities dump generation to fixed day-of-month instead of fixed weekday
T144103: Create .nt (NTriples) dumps for wikidata data
T72385: Wikidata JSON dump: file directory location should follow standard patterns
- Mentioned Here
- T46581: Partial dumps
T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth
As far as I can see, nothing happened here for a year. Moreover, subtasks are also dormant for a year. So it seems to be stalled. But if I'm wrong and something is happening here, please reclassify.
I think Wikidata-Toolkit could be used for that:
Obviously it would mean making sure the RDF serialization produced by it is consistent with what is being fed in WDQS at the moment.
Indeed, and it already gets the JSON dumps loaded into it.
However this would mean keeping the JSON -> RDF mapping in multiple places (both in PHP in Wikbiase & elsewhere to interface with hadoop).
Though that sounds like something that we could deal with in some way?
Indeed, the RDF data is available in the hive table discovery.wikibase_rdf but it is generated reading the TTL dumps so it might not help for this particular task.
Using hadoop will indeed allow to process the json efficiently but has drawbacks as already pointed out:
- requires maintaining the Wikibase -> RDF projection in multiple codebases (PHP wikibase & in spark)
- once created from the hadoop cluster it will have to be pushed back to the labstore machine for public consumption and might add extra delay
When T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth is ready we could probably change some of the architecture and process around dumping for Wikidata.org
We would likely keep the existing scripts as they are for the Wikibase usecases, and may still want to create a script that generates TTL/RDF from a JSON dump
For Wikidata.org we could move towards
edit--> kafka --> streaming job --> WMF-API(get content) --> store on HDFS
And then from HDFS generate JSON, RDF, TTL dumps much faster and consistently
This will likely tie into the ongoing wikidata / wikibase subsetting discussions too, as subsetting dumps from HDFS will be much easier than while using the existing systems.
See T46581: Partial dumps etc.
But most of this probably lives in separate tickets.