Page MenuHomePhabricator

QuickStatements to RDF converter
Closed, ResolvedPublic

Description

As agreed with @Lydia_Pintscher , @Smalyshev , and @Tpt at Wikimania 2017, the back end should only consume RDF.
On the other hand, the data provider can choose between RDF and QuickStatements.

Therefore, a conversion facility from QuickStatements to Wikidata RDF is needed.

Event Timeline

@Lydia_Pintscher , @Smalyshev and @Tpt : is there any info about how RDF is expected to behave as an import format for Wikidata? As far as I can tell, the RDF that gets fed into the Query Service is not designed for import at all:

  • first, there is a lot of redundancy: values are represented by simple values and value nodes, truthy statements are redundant with statement nodes, and other things like that. (this is absolutely not a criticism of the RDF serialization strategy: it totally makes sense as an export format!) So is there any designated subset of the exported triples that data producers would need to emit? I assume that subset would need to be as expressive as possible (so, for instance, the truthy triples would be dropped in favor of the full statement nodes). That is going to be very verbose, right?
  • second, the identifiers on the nodes are generated by Wikibase: so, how does a data producer pick identifiers? Is it just going to impose its own hashes that Wikibase will have to respect?

It would be great to have something else than QuickStatements to represent a data import, but I still have doubts about why RDF is suitable for that in the first place. The good thing about RDF is that it is a standard, so many tools can deal with it. But given the issues mentioned above, I expect it is going to be quite painful to reuse these tools to produce data in the right schema, as everything is deeply reified. Anyway, if that is the path you have chosen, we need specs please!

Also, it seems that this project uses Java, so may I suggest that the reusable parts go to the Wikidata-Toolkit rather than the Primary Sources Tool? Wikidata-Toolkit has already got RDF export, so it would make sense to have RDF import (from RDF statements to the datamodel representation, say).

If I understand the use case right, you may not need the full data set, and you won't be actually importing the data into WDQS database. So you can change the RDF model according to your case.

For identifiers, right now they are non-portable and implementation-dependant (see T167759) - except for statement ones, which are just GUIDs. But again I think based on use case that you have, you can choose any consistent scheme (preferably which would return same ID for the same data) as you won't be mixing actual WDQS data and imported data directly.

@Smalyshev thanks for your quick reply! Just for clarity, I am not personally working on the PST, I was just trying to find out if there was any established way to use RDF to represent a data import. If that is the case, then other tools could use that format too (for instance, OpenRefine could export datasets to this format). I'd be happy to work on that but I can only do it if a RDF model is agreed on.

It looks like the Munger from wikidata-query-rdf can be used to fill a Wikibase instance, but that is a different use case.

I am not aware of any "standard" but this is the time to make one then I guess ;-)

Right now RDF is only a secondary database format, so all imports are supposed to go through Wikibase instance, using one of Wikibase import formats. Using RDF for import AFAIK is not a use case we currently implement.

Right and I don't think there is desire to change that. But for tools like the primary sources tool that could ingest rdf and then feed data into Wikidata through the usual API I think that'd be ok.

@Lydia_Pintscher that makes sense. Okay, thank you to you both, we are on the same page! Given all these tickets on the topic I was worried that I had missed something obvious about this issue…

@Pintoch I think that Wikidata RDF for data imports should be as concise as possible to facilitate data providers. I'm trying to reduce the RDF verbosity to a maximum, avoiding reified nodes where possible.
Of course, QuickStatements is much more compact, but totally non-standard, which is likely to mean no parsers, serializers, validators, etc. This was the main "community" rationale for choosing RDF.
Technically speaking, RDF also enables fine-grained filtering of suggested statements via SPARQL queries, which are very useful to improve things like this: M218/691.

Talking about specs, I've come up with what I think should be the relevant Wikidata RDF subset for data providers. I'm working with the WikiFactMine (ping @Tarrow ) guys to do a first real-world test.
Then, I'll outsource the specs for discussion.

Hjfocs moved this task from Doing to Done on the Wikidata-primary-sources board.

First working version as a standalone tool: https://github.com/marfox/qs2rdf