Page MenuHomePhabricator

Data format updates for RDF export
Closed, ResolvedPublic

Description

The following format changes were suggested by Markus:

  1. Ontology names should be types-specific and start with lowercase, e.g. wikibase:timePrecision
  2. Use separate URL/prefix for every context (we right now reuse v: in statements and references)

Also, it was suggested that we may want to change the fact that we use entity:P1234 in link Entity->Statement and give it a distinct URL. However, then it is not clear what would be the link between entity:P1234 and the rest of the data.

See URL scheme proposal at: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format/Proposal

Related Objects

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Normal.
Smalyshev updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2015, 11:34 PM

Also, it was suggested that we may want to change the fact that we use entity:P1234 in link Entity->Statement and give it a distinct URL. However, then it is not clear what would be the link between entity:P1234 and the rest of the data.

This is a good point. It affects all property variants (qualifiers, values, ...) that we generate. We should have explicit links from the Enitity P1234 to every RDF property that we use to model this Wikidata property in different contexts. WDTK already has an open issue on this: https://github.com/Wikidata/Wikidata-Toolkit/issues/84

Smalyshev set Security to None.Mar 21 2015, 9:43 PM

@mkroetzsch is there any existing ontology we may want to use to create such links between entity:P1234 and v:P1234 or q:P1234? Or should we just invent our own?

Also, if we never use entity:P1234 in statements, to look it up (e.g. for type, etc. if we add type to property export, or for properties) one would have to do additional hop with something like: ?entity wikibase:represents v:P1234 instead of just using it directly. Not sure if it's a big issue.

is there any existing ontology we may want to use to create such links between entity:P1234 and v:P1234 or q:P1234? Or should we just invent our own?

We would have to make new URIs here. This depends on which/how many variants of RDF property URIs we use: we should use a different link for each kind of RDF property variant. For example, we could have :P1234 wikibase:qualifierProperty q:P1234.

Also, if we never use entity:P1234 in statements, to look it up (e.g. for type, etc. if we add type to property export, or for properties) one would have to do additional hop with something like: ?entity wikibase:represents v:P1234 instead of just using it directly. Not sure if it's a big issue.

I would say that it is not a big issue since most of the RDF properties we use will always have that problem. If we use the property entity as an RDF property, it would only replace one of the uses of property variants. In all other places, you would still need the additional hop to get the label.

It would be good if the linked data export for all RDF property variants could include the entity labels. I would not add them to the dumps though (1000 x 300 x 5 is a lot of additional triples).

For convenient SPARQL-based access, we should provide query interfaces that retrieve labels for IRIs that occur in query results so that users don't have to SPARQL for the label. Such post-query labelling is done in WDQ. It will be easy to extend this to property variants without even looking at the RDF graph. This will make the SPARQL queries much lighter in general.

Looks like schema: uses capital case for objects - <something> a schema:Article but lowercase for predicates - <something> schema:inLanguage "en". We should probably follow that.

@Smalyshev Yes, using lower-case local names for properties is a widely used convention and we should definitely follow that for our ontology. However, I would rather not change case of our P1234 property ids when they occur in property URIs, since Wikibase ids might be case sensitive in the future (Commons files will have their filename as id, and even if standard MW is first-letter case-insensitive in articles, it can be configured to be otherwise). It would also create some confusion if one would have to write "p1234" in some interfaces and "P1234" in others (maybe even both would occur in RDF since we have a P1234 entity and several related properties).

I agree, I wouldn't change P1234 to lowercase, I was talking about other things like predicates wikibase:Rank or wikibase:Badge which probably should be wikibase:rank and wikibase:badge or wikibase:hasBadge.

I would not add them to the dumps though (1000 x 300 x 5 is a lot of additional triples).

Don't see why it would be this many. It'd be like 4 additional rows per property:

entity:P5 a wikibase:Property ;
  wikibase:property p:P5 ;
  wikibase:qualifier q:P5 ;
  wikibase:assert wdt:P5 ;
  wikibase:reference r:P5 .

We have ~2000 props, so just 8000 new triples, not a big deal IMO. We could of course have it in separate dump but I don't see it as a big issue.

Change 200117 had a related patch set uploaded (by Smalyshev):
T93451: ontology fix - ontology predicates go in lowercase

https://gerrit.wikimedia.org/r/200117

Change 200119 had a related patch set uploaded (by Smalyshev):
T93451: ontology fixes - namespace the value predicates

https://gerrit.wikimedia.org/r/200119

Don't see why it would be this many. It'd be like 4 additional rows per property:

I was referring to the labels. For some use cases, it could be convenient of each of the property variants would also have the rdfs:label of the property item. For example, RDF browsers will not be able to label a property variant such as :P1234q (or whatever we use) if we don't include any label for it. But including all labels (up to 300 languages) for all variants would lead to a lot of triples in the dump.

Ah, no, that'd not be good to add so many labels. But :P1234q or q:P1234 are predicates, and RDF browsers should be able to handle predicates without labels, no? Because something like schema:about probably doesn't have any labels either. I think in general it would be nice to separate querying and labeling - i.e. query works in terms of Qs and Ps and then there's a way to get labels. But not sure if that'd help with RDF browsers.

Change 200117 merged by jenkins-bot:
T93451: ontology fix - ontology predicates go in lowercase

https://gerrit.wikimedia.org/r/200117

All RDF tools should be able to handle resources without labels (no matter if used as subject, predicate, or objcet). But data browsers or other UIs will simply show the URL (or an automatically created abbreviated version of it) to the user. So instead of "instance of" it would read something like "http://www.wikidata.org/entity/P31c". Nevertheless, we can accept this for now. AFAIK there are no widely used generic RDF data browsers anyway, and it's much more likely that people will first create Wikidata-aware interfaces.

Change 200274 had a related patch set uploaded (by Smalyshev):
T93451: drop version from ontology

https://gerrit.wikimedia.org/r/200274

Change 200119 merged by jenkins-bot:
T93451: ontology fixes - namespace the value predicates

https://gerrit.wikimedia.org/r/200119

Change 200274 merged by jenkins-bot:
T93451: drop version from ontology

https://gerrit.wikimedia.org/r/200274

Smalyshev closed this task as Resolved.Apr 23 2015, 5:20 PM