Page MenuHomePhabricator

Create a truthy nt dump
Closed, ResolvedPublic

Description

For various use cases, such as set up a quickly testable endpoint, it would be nice to have a truthy nt triple dump.

Note: What's included in this dump needs to be described on Wikidata:Database download or somewhere similar.

Details

Related Gerrit Patches:
operations/puppet : productionCreate truthy nt Wikidata entity dump each Monday
operations/puppet : productionWikidata entity dumps: Allow nt RDF dumps
operations/puppet : productionAllow running two dumpwikidatattl dumps side by side
operations/puppet : productionChange dumpwikidatattl to allow producing other flavors
mediawiki/extensions/Wikibase : wmf/1.29.0-wmf.19dumpRdf: Allow creating truthy dumps
mediawiki/extensions/Wikibase : masterdumpRdf: Allow creating truthy dumps

Event Timeline

Lucie created this task.Jan 11 2017, 6:20 PM
hoo claimed this task.Mar 17 2017, 6:40 AM

There seems to be indeed a need for this.

I will look into this as soon as possible. The next steps will be to create a one-off dump and see how long that takes/ how large it is.
And additional question is the compression we want to use here, gzip, bzip2 (these are the two we have for the current dumps), 7z, xz, zstd, …?

hoo added a comment.Mar 27 2017, 2:40 PM

Shall this just include RdfProducer::PRODUCE_TRUTHY_STATEMENTS?

We potentially at least also want RdfProducer::PRODUCE_PROPERTIES ("Add entity definitions for properties used in the dump"), RdfProducer::PRODUCE_VERSION_INFO ("Produce metadata header containing software version info and copyright.") and RdfProducer::PRODUCE_NORMALIZED_VALUES ("Produce normalized values for values with units."). Possibly we also want RdfProducer::PRODUCE_RESOLVED_ENTITIES ("Produce definitions for all entities used in the dump"), although I'm not sure what the indications of that exactly are.

Maybe interesting: EntityDataSerializationService::getFlavor (although that doesn't have a "truthy" flavor).

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

hoo added a comment.Mar 28 2017, 12:18 PM

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

We're not talking about a full nt dump here (that's T144103), but just a truthy dump.

Lucie added a comment.Mar 28 2017, 1:22 PM

To answer all of those to my best knowledge at once:

  • bzip2 should be fine
  • Including property definitions will most likely be very handy, though I am not sure what that means exactly- it just includes all statements for properties? Than yes!
  • Version info absolutely
  • Normalized values make a lot of sense
  • Not sure what resolved entities should therefore I can't say whether it makes sense

Thanks, looking forward to the truthy dump :D

Change 346636 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

hoo updated the task description. (Show Details)Apr 6 2017, 9:19 AM
hoo added a comment.Apr 6 2017, 10:55 AM

I would suggest to put the dumps like https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-truthy-BETA.nt.gz (compared to https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-all-BETA.ttl.gz for the current full ttl dump).

Change 346636 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

Change 347234 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 347840 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

Change 347840 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

hoo added a comment.Apr 12 2017, 1:58 PM

I did a (test) dump of test.wikidata.org just now: https://people.wikimedia.org/~hoo/tmp/testwikidata-20170412-truthy-BETA.nt.gz

Please have a look at it!

thanks a lot for the help

it looks ok, only a small question .

is it normal to have UTF labels being skipped inside the ASCII like that ?, cant we just output everything in UTF-8 or 16

<http://test.wikidata.org/entity/Q145> <http://www.w3.org/2004/02/skos/core#prefLabel> "\u30DD\u30BA\u30CA\u30F3"@ja .

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

@Hadyelsahar Not sure what you mean by "skipped". It's normal for labels to be encoded with \u sequences, since not all tools can handle all Unicode properly, unfortunately. Any tool that reads TTL should be able to handle encoded sequences though.

hoo added a comment.Apr 13 2017, 2:56 PM

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

I guess we don't strictly need if for now. Adding things later on is trivial, so we could just go with this and expand it if there's an actual need for these?

Change 348095 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

Change 347234 merged by ArielGlenn:
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 merged by ArielGlenn:
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 348095 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 merged by ArielGlenn:
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

hoo added a comment.Apr 13 2017, 5:33 PM

The first truthy nt dump should appear next Tuesday (probably late UTC).

I'll keep this open until we actually have it.

@hoo Your help is much appreciated :) thanks a lot

hoo closed this task as Resolved.Apr 19 2017, 4:51 PM
hoo removed a project: Patch-For-Review.

The first truthy nt dump can be found at https://dumps.wikimedia.org/wikidatawiki/entities/20170418/. New truthy nt dumps will appear weekly on https://dumps.wikimedia.org/wikidatawiki/entities/ at about the same time (mid to late Wednesday UTC).

It is considerably smaller than the full dump in file size (especially considering that this is an nt dump, not a ttl one). I don't have any numbers regarding the number of triples in each dump type, but I expect the truthy dump to have considerably fewer triples than the full dump.

Lucie awarded a token.May 28 2017, 9:45 AM
ArielGlenn moved this task from Backlog to Done on the Dumps-Generation board.Jun 19 2017, 9:38 AM