Page MenuHomePhabricator

Create a truthy nt dump
Closed, ResolvedPublic

Description

For various use cases, such as set up a quickly testable endpoint, it would be nice to have a truthy nt triple dump.

Note: What's included in this dump needs to be described on Wikidata:Database download or somewhere similar.

Event Timeline

There seems to be indeed a need for this.

I will look into this as soon as possible. The next steps will be to create a one-off dump and see how long that takes/ how large it is.
And additional question is the compression we want to use here, gzip, bzip2 (these are the two we have for the current dumps), 7z, xz, zstd, …?

Shall this just include RdfProducer::PRODUCE_TRUTHY_STATEMENTS?

We potentially at least also want RdfProducer::PRODUCE_PROPERTIES ("Add entity definitions for properties used in the dump"), RdfProducer::PRODUCE_VERSION_INFO ("Produce metadata header containing software version info and copyright.") and RdfProducer::PRODUCE_NORMALIZED_VALUES ("Produce normalized values for values with units."). Possibly we also want RdfProducer::PRODUCE_RESOLVED_ENTITIES ("Produce definitions for all entities used in the dump"), although I'm not sure what the indications of that exactly are.

Maybe interesting: EntityDataSerializationService::getFlavor (although that doesn't have a "truthy" flavor).

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

We're not talking about a full nt dump here (that's T144103), but just a truthy dump.

To answer all of those to my best knowledge at once:

  • bzip2 should be fine
  • Including property definitions will most likely be very handy, though I am not sure what that means exactly- it just includes all statements for properties? Than yes!
  • Version info absolutely
  • Normalized values make a lot of sense
  • Not sure what resolved entities should therefore I can't say whether it makes sense

Thanks, looking forward to the truthy dump :D

Change 346636 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

I would suggest to put the dumps like https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-truthy-BETA.nt.gz (compared to https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-all-BETA.ttl.gz for the current full ttl dump).

Change 346636 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

Change 347234 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 347840 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

Change 347840 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

thanks a lot for the help

it looks ok, only a small question .

is it normal to have UTF labels being skipped inside the ASCII like that ?, cant we just output everything in UTF-8 or 16

<http://test.wikidata.org/entity/Q145> <http://www.w3.org/2004/02/skos/core#prefLabel> "\u30DD\u30BA\u30CA\u30F3"@ja .

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

@Hadyelsahar Not sure what you mean by "skipped". It's normal for labels to be encoded with \u sequences, since not all tools can handle all Unicode properly, unfortunately. Any tool that reads TTL should be able to handle encoded sequences though.

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

I guess we don't strictly need if for now. Adding things later on is trivial, so we could just go with this and expand it if there's an actual need for these?

Change 348095 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

Change 347234 merged by ArielGlenn:
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 merged by ArielGlenn:
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 348095 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 merged by ArielGlenn:
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

The first truthy nt dump should appear next Tuesday (probably late UTC).

I'll keep this open until we actually have it.

@hoo Your help is much appreciated :) thanks a lot

hoo removed a project: Patch-For-Review.

The first truthy nt dump can be found at https://dumps.wikimedia.org/wikidatawiki/entities/20170418/. New truthy nt dumps will appear weekly on https://dumps.wikimedia.org/wikidatawiki/entities/ at about the same time (mid to late Wednesday UTC).

It is considerably smaller than the full dump in file size (especially considering that this is an nt dump, not a ttl one). I don't have any numbers regarding the number of triples in each dump type, but I expect the truthy dump to have considerably fewer triples than the full dump.