Page MenuHomePhabricator

RDF dumps for Structured Data on Commons are broken
Closed, ResolvedPublic

Description

SDC should follow standard URI pattern for RDF dumps, but it does not.

The latest config change in this area was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/609987/9/wmf-config/InitialiseSettings.php

Currently, wikidata's data and entity URI prefixes look like this:

@prefix wd: <http://www.wikidata.org/entity/> .
@prefix data: <https://www.wikidata.org/wiki/Special:EntityData/> .

Currently sdc has those prefixes.

@prefix sdc: <https://commons.wikimedia.org/wiki/Special:EntityData/> .
@prefix sdcdata: <https://commons.wikimedia.org/wiki/Special:EntityData/> .

With other prefixes build upon data prefix (or, as always, on entity prefix, but here they are identical).

sdc should be similar to wikidata:

@prefix sdc: <https://commons.wikimedia.org/entity/> .
@prefix sdcdata: <https://commons.wikimedia.org/wiki/Special:EntityData/> .

The correct prefixes should be:

@prefix sdc: <https://commons.wikimedia.org/entity/> .
@prefix sdcdata: <https://commons.wikimedia.org/wiki/Special:EntityData/> .
@prefix sdcs: <https://commons.wikimedia.org/entity/statement/> .
@prefix sdcref: <https://commons.wikimedia.org/reference/> .
@prefix sdcv: <https://commons.wikimedia.org/value/> .
@prefix sdct: <https://commons.wikimedia.org/prop/direct/> .
@prefix sdctn: <https://commons.wikimedia.org/prop/direct-normalized/> .
@prefix sdcp: <https://commons.wikimedia.org/prop/> .
@prefix sdcps: <https://commons.wikimedia.org/prop/statement/> .
@prefix sdcpsv: <https://commons.wikimedia.org/prop/statement/value/> .
@prefix sdcpsn: <https://commons.wikimedia.org/prop/statement/value-normalized/> .
@prefix sdcpq: <https://commons.wikimedia.org/prop/qualifier/> .
@prefix sdcpqv: <https://commons.wikimedia.org/prop/qualifier/value/> .
@prefix sdcpqn: <https://commons.wikimedia.org/prop/qualifier/value-normalized/> .
@prefix sdcpr: <https://commons.wikimedia.org/prop/reference/> .
@prefix sdcprv: <https://commons.wikimedia.org/prop/reference/value/> .
@prefix sdcprn: <https://commons.wikimedia.org/prop/reference/value-normalized/> .
@prefix sdcno: <https://commons.wikimedia.org/prop/novalue/> .

NT dumps are also affected.

Event Timeline

dcausse renamed this task from RDF TTL format for Structured Data on Commons is broken to RDF dumps for Structured Data on Commons are broken.Jul 21 2020, 8:42 AM
dcausse triaged this task as High priority.
dcausse updated the task description. (Show Details)

I wonder if these IRIs should not be http instead of https because the Concept URI link on the file page refers to http://commons.wikimedia.org/entity/M123.

I wonder if these IRIs should not be http instead of https because the Concept URI link on the file page refers to http://commons.wikimedia.org/entity/M123.

Having a decision there would be great :)
I don't know if the decision to go with https was a conscious one or not.

I think the Concept URI link should match what's in the dump. T226453#5436469 suggests that if these URIs for commons are not yet too widespread https could be used.

dcausse updated the task description. (Show Details)

Change 615171 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [sdoc] fix entity source base URIs

https://gerrit.wikimedia.org/r/615171

Patch looks good and IMO can be deployed whenever!

Change 615171 merged by jenkins-bot:
[operations/mediawiki-config@master] [sdoc] fix entity source base URIs

https://gerrit.wikimedia.org/r/615171

Mentioned in SAL (#wikimedia-operations) [2020-07-23T11:13:23Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T258474: [sdoc] fix entity source base URIs (duration: 01m 07s)

Ticket description should be re-written.

SDC doesn't have its own properties, so prefixes like sdcp, sdcps etc are not appropriate and should not appear. (cf discussion at T258625)

@Jheald correct but in these tickets we always mentioned the list of all the sdc related prefixes seen in the dumps, the reason the dump is emitting all these "unneeded" prefixes is that it does not know at the time it writes the file header what are the list of effective prefixes it will need. I don't think it hurts to have unneeded prefixes declared in the turtle header.

@dcausse It *does* hurt a person who is trying to make sense of the dump, because they will see all these unfamiliar prefixes declared that they may then assume there will be corresponding kinds of predicates or objects that they have to make sense of.

Better to remove all the ones that we do not use, to make clearer the specific sdc ones that we do use.

@Jheald I created T259587 for the problem you raised, the tickets here are for fixing the problems that make the dumps unusable. Unused prefixes, while confusing, are perfectly valid and won't cause a TTL parser to fail nor alter the meaning of the graph.