Compare the structure etc of the dumps of https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps and https://people.wikimedia.org/~hoo/tmp/
Description
Related Objects
Event Timeline
The docs for the new RDF dump format are here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
First results are collected in a spreadsheet here: https://docs.google.com/a/wikimedia.de/spreadsheets/d/1cI7EYMiyUIqqsvMxPH5Zryt8dVIxJb0bYOOtBY-cSno
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150223/ | https://dumps.wikimedia.org/wikidatawiki/entities/20150420/ | |
file ending/type | nt (subset of RDF/ttl) | ttl |
triple | the whole link <http://wikidata.org/entity/Q1> | turtle (prefixes) |
dumps | multiple dumps | one, https://phabricator.wikimedia.org/T93488 |
labels (&aliases&descriptions) | one language per Label + Tripel http://www.w3.org/2000/01/rdf-schema#label | per language three triple: rdfs:label, skos:prefLabel, schema:name |
statment GUID | always uppercase, starting with S (Q1Sf5d5115d-489a-7654-9a0a-5eea5be80d07) | sometimes upper, sometimes lowercase, starting with - (q1-0479EB23-FC5B-4EEC-9529-CEE21D6C6FA9) |
statement value | e:Q1Sguid e:P1036v 113 // truthy would be with suffix c | as triple, 'truthy': e:Q1 wdt:P1036 "113"; also as full statement |
properties | with P123s for stament and P123v for value | prefix s (statement) for staments and wtd (assert) for values (in full statements)- otherwise prefix v |
sitelinks | no badges, enwikilink a wikidata.org/ontology#Article | badges, enwikilink a schema:Article |
Metadata (like license and date) | no | yes |
defining WD links as types of rdf Classes | yes | no (planed as seperate OWL file; https://phabricator.wikimedia.org/T97522) |
calendars | gegorian | julian and gregorian |
I will do a complete review of the update RDF mapping in the course of the next week. I will report back then if there is anything missing in the diff.
Also, what is the expected outcome of this bug? A table like the one posted by Lucie? Or something with more detail? Some rows in the current table are probably only understood by people who already know both dumps ;-) Is this meant to be only for our "internal" information?
Another relevant note here might be that the plan is to fully align WDTK mappings with the updated RDF dumps, so that many of the above will go away (the split into several files would remain though). We just did not do this while we were still discussing the updated RDF mapping.
- The second row difference is just a consequence of what was already stated in the first row (NTriples vs Turtle). Maybe this can be merged/deleted.
- It seems that the entry in row "labels (&aliases&descriptions)" only refers to "labels". The properties "skos:prefLabel" and "schema:name" are not used for descriptions or aliases in either dumps, AFAIK.
- It would make sense to distinguish differences in distribution/surface syntax (which format, how many files, which compression algorithm, ...) from real differences in the RDF model (=differences that matter for SPARQL users).
Correction: our dates in RDF are gregorian only (xsd:dateTime), but calendar is kept so you can display it as Julian. But the dates themselves should be Gregorian, if this doesn't work that'd be pretty major bug. We have code when if the date is really bad we represent it as string, but we may drop it eventually and just not represent such dates at all.
About GUIDs - we always use the actual GUID that's in the data, not sure what WDTK does.
About the classes - I think it's important to emphasize that assigning fixed classes to our properties (T97522) - i.e. wikibase:Statement rdf:type owl:Class and having support for converting something like P31 to rdf:type are entirely different things. Some people confuse it, which gets even more confusing because WDTK seems to have support for both. We need to clearly distinguish these.
For properties I'd link to https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Predicates since we have 8 forms for property predicate. Not counting wdno: which is kind of related to property too but is a class for technical reasons - which actually almost the same that WDTK is doing but I think naming is different.
Thank you!
Also, what is the expected outcome of this bug? A table like the one posted by Lucie? Or something with more detail? Some rows in the current table are probably only understood by people who already know both dumps ;-) Is this meant to be only for our "internal" information?
This is for us internally to make sure we're all on the same page and are good with what we have. When we're further along we can check what kind of public-facing documentation we need.