Page MenuHomePhabricator

Make rdfDump.php export in WDQS-ready format
Open, LowPublic

Description

At the moment, the Wikibase dumpRdf.php script produces TTL that should be piped through the Munge script before getting loaded into WDQS.

The Munge script removes some triples for better performance and splits the TTL into smaller chunks before being loaded into Blazegraph.

This step could be skipped if dumpRdf.php had an option to write TTL already in the optimized format, and optionally export a series of files instead of just one.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as Low priority.Sep 15 2020, 7:59 AM

To specify: the process of getting triples exported from Wikibase into Blazegraph is not great, a bunch of shell script wrappers around some Java application, neither being documented very well. It leads to weird side effects, such as requiring to use the /wiki/ path on the Mediawiki that holds the Wikibase T274354. The bigger story is to make the process to sync triples with Blazegraph more robust and controllable, ideally not requiring a separate container.

Probably the Wikidata setup, which has to deal with an amount of edits that self-deployed Wikibases usually don't see, has different requirements.

This is probably strongly related to T287231: Consider moving WDQS "munging" of RDF into Wikibase RDF output code
If the desired output for wdqs was held within Wikibase or a PHP extension, then:

  1. dumpRdf could dump data in the format expected by wdqs, removing the need for munging
  2. Special:EntityData could serve the right format, avoiding the need for munging during updates
  3. The same RDF could more easily be pushed into other stores (not blazegraph)
  4. Other update mechanisms (such as the use of MediaWikiJobs become more realistic for Wikibase for 3rd parties)