Page MenuHomePhabricator

Comparison of the existing Wikidata RDF dumps
Closed, ResolvedPublic1 Estimated Story Points

Event Timeline

Lucie raised the priority of this task from to Needs Triage.
Lucie updated the task description. (Show Details)
Lucie added a subscriber: Lucie.
Lucie set Security to None.
Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150223/https://dumps.wikimedia.org/wikidatawiki/entities/20150420/
file ending/typent (subset of RDF/ttl)ttl
triplethe whole link <http://wikidata.org/entity/Q1>turtle (prefixes)
dumpsmultiple dumpsone, https://phabricator.wikimedia.org/T93488
labels (&aliases&descriptions)one language per Label + Tripel http://www.w3.org/2000/01/rdf-schema#labelper language three triple: rdfs:label, skos:prefLabel, schema:name
statment GUIDalways uppercase, starting with S (Q1Sf5d5115d-489a-7654-9a0a-5eea5be80d07)sometimes upper, sometimes lowercase, starting with - (q1-0479EB23-FC5B-4EEC-9529-CEE21D6C6FA9)
statement valuee:Q1Sguid e:P1036v 113 // truthy would be with suffix cas triple, 'truthy': e:Q1 wdt:P1036 "113"; also as full statement
propertieswith P123s for stament and P123v for valueprefix s (statement) for staments and wtd (assert) for values (in full statements)- otherwise prefix v
sitelinksno badges, enwikilink a wikidata.org/ontology#Articlebadges, enwikilink a schema:Article
Metadata (like license and date)noyes
defining WD links as types of rdf Classesyesno (planed as seperate OWL file; https://phabricator.wikimedia.org/T97522)
calendarsgegorianjulian and gregorian

Are there any differences we're missing? Are we ok with these differences?

Are there any differences we're missing? Are we ok with these differences?

I will do a complete review of the update RDF mapping in the course of the next week. I will report back then if there is anything missing in the diff.

Also, what is the expected outcome of this bug? A table like the one posted by Lucie? Or something with more detail? Some rows in the current table are probably only understood by people who already know both dumps ;-) Is this meant to be only for our "internal" information?

Another relevant note here might be that the plan is to fully align WDTK mappings with the updated RDF dumps, so that many of the above will go away (the split into several files would remain though). We just did not do this while we were still discussing the updated RDF mapping.

@Lucie:

  • The second row difference is just a consequence of what was already stated in the first row (NTriples vs Turtle). Maybe this can be merged/deleted.
  • It seems that the entry in row "labels (&aliases&descriptions)" only refers to "labels". The properties "skos:prefLabel" and "schema:name" are not used for descriptions or aliases in either dumps, AFAIK.
  • It would make sense to distinguish differences in distribution/surface syntax (which format, how many files, which compression algorithm, ...) from real differences in the RDF model (=differences that matter for SPARQL users).

Correction: our dates in RDF are gregorian only (xsd:dateTime), but calendar is kept so you can display it as Julian. But the dates themselves should be Gregorian, if this doesn't work that'd be pretty major bug. We have code when if the date is really bad we represent it as string, but we may drop it eventually and just not represent such dates at all.

About GUIDs - we always use the actual GUID that's in the data, not sure what WDTK does.

About the classes - I think it's important to emphasize that assigning fixed classes to our properties (T97522) - i.e. wikibase:Statement rdf:type owl:Class and having support for converting something like P31 to rdf:type are entirely different things. Some people confuse it, which gets even more confusing because WDTK seems to have support for both. We need to clearly distinguish these.

For properties I'd link to https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Predicates since we have 8 forms for property predicate. Not counting wdno: which is kind of related to property too but is a class for technical reasons - which actually almost the same that WDTK is doing but I think naming is different.

I will do a complete review of the update RDF mapping in the course of the next week. I will report back then if there is anything missing in the diff.

Thank you!

Also, what is the expected outcome of this bug? A table like the one posted by Lucie? Or something with more detail? Some rows in the current table are probably only understood by people who already know both dumps ;-) Is this meant to be only for our "internal" information?

This is for us internally to make sure we're all on the same page and are good with what we have. When we're further along we can check what kind of public-facing documentation we need.

Addshore added a subscriber: Addshore.

(as its been moved to done on the sprint)

It's done from our side for the sprint but we'll keep it open for Markus.

Lucie removed Lucie as the assignee of this task.Jul 27 2015, 10:50 AM