For various use cases, such as set up a quickly testable endpoint, it would be nice to have a truthy nt triple dump.
Note: What's included in this dump needs to be described on Wikidata:Database download or somewhere similar.
For various use cases, such as set up a quickly testable endpoint, it would be nice to have a truthy nt triple dump.
Note: What's included in this dump needs to be described on Wikidata:Database download or somewhere similar.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
Open | None | T88991 improve Wikidata dumps [tracking] | |||
Resolved | hoo | T155103 Create a truthy nt dump | |||
Open | None | T162346 Include truthy nt dumps in the Wikidata Dump Downloads Grafana dashboard | |||
Resolved | Lokal_Profil | T163328 Add the truthy nt dump to dcat-AP | |||
Resolved | Lucie | T166461 Add documentation of truthy nt dumps |
There seems to be indeed a need for this.
I will look into this as soon as possible. The next steps will be to create a one-off dump and see how long that takes/ how large it is.
And additional question is the compression we want to use here, gzip, bzip2 (these are the two we have for the current dumps), 7z, xz, zstd, …?
Shall this just include RdfProducer::PRODUCE_TRUTHY_STATEMENTS?
We potentially at least also want RdfProducer::PRODUCE_PROPERTIES ("Add entity definitions for properties used in the dump"), RdfProducer::PRODUCE_VERSION_INFO ("Produce metadata header containing software version info and copyright.") and RdfProducer::PRODUCE_NORMALIZED_VALUES ("Produce normalized values for values with units."). Possibly we also want RdfProducer::PRODUCE_RESOLVED_ENTITIES ("Produce definitions for all entities used in the dump"), although I'm not sure what the indications of that exactly are.
Maybe interesting: EntityDataSerializationService::getFlavor (although that doesn't have a "truthy" flavor).
It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.
We're not talking about a full nt dump here (that's T144103), but just a truthy dump.
To answer all of those to my best knowledge at once:
Thanks, looking forward to the truthy dump :D
Change 346636 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps
I would suggest to put the dumps like https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-truthy-BETA.nt.gz (compared to https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-all-BETA.ttl.gz for the current full ttl dump).
Change 346636 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps
Change 347234 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors
Change 347838 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side
Change 347840 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps
Change 347840 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps
I did a (test) dump of test.wikidata.org just now: https://people.wikimedia.org/~hoo/tmp/testwikidata-20170412-truthy-BETA.nt.gz
Please have a look at it!
thanks a lot for the help
it looks ok, only a small question .
is it normal to have UTF labels being skipped inside the ASCII like that ?, cant we just output everything in UTF-8 or 16
<http://test.wikidata.org/entity/Q145> <http://www.w3.org/2004/02/skos/core#prefLabel> "\u30DD\u30BA\u30CA\u30F3"@ja .
Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.
@Hadyelsahar Not sure what you mean by "skipped". It's normal for labels to be encoded with \u sequences, since not all tools can handle all Unicode properly, unfortunately. Any tool that reads TTL should be able to handle encoded sequences though.
I guess we don't strictly need if for now. Adding things later on is trivial, so we could just go with this and expand it if there's an actual need for these?
Change 348095 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps
Change 348096 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday
Change 347234 merged by ArielGlenn:
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors
Change 347838 merged by ArielGlenn:
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side
Change 348095 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps
Change 348096 merged by ArielGlenn:
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday
The first truthy nt dump should appear next Tuesday (probably late UTC).
I'll keep this open until we actually have it.
The first truthy nt dump can be found at https://dumps.wikimedia.org/wikidatawiki/entities/20170418/. New truthy nt dumps will appear weekly on https://dumps.wikimedia.org/wikidatawiki/entities/ at about the same time (mid to late Wednesday UTC).
It is considerably smaller than the full dump in file size (especially considering that this is an nt dump, not a ttl one). I don't have any numbers regarding the number of triples in each dump type, but I expect the truthy dump to have considerably fewer triples than the full dump.