Create a truthy nt dump
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lucie
	Jan 11 2017, 6:20 PM

Description

For various use cases, such as set up a quickly testable endpoint, it would be nice to have a truthy nt triple dump.

Note: What's included in this dump needs to be described on Wikidata:Database download or somewhere similar.

Details

Subject	Repo	Branch	Lines +/-
Create truthy nt Wikidata entity dump each Monday	operations/puppet	production	+3 -2
Wikidata entity dumps: Allow nt RDF dumps	operations/puppet	production	+25 -23
Allow running two dumpwikidatattl dumps side by side	operations/puppet	production	+8 -8
Change dumpwikidatattl to allow producing other flavors	operations/puppet	production	+24 -3
dumpRdf: Allow creating truthy dumps	mediawiki/extensions/Wikibase	wmf/1.29.0-wmf.19	+176 -28
dumpRdf: Allow creating truthy dumps	mediawiki/extensions/Wikibase	master	+176 -28

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Resolved	hoo	T155103 Create a truthy nt dump
Open	None	T162346 Include truthy nt dumps in the Wikidata Dump Downloads Grafana dashboard
Resolved	Lokal_Profil	T163328 Add the truthy nt dump to dcat-AP
Resolved	Lucie	T166461 Add documentation of truthy nt dumps

Event Timeline

Lucie created this task.Jan 11 2017, 6:20 PM

There seems to be indeed a need for this.

I will look into this as soon as possible. The next steps will be to create a one-off dump and see how long that takes/ how large it is.
And additional question is the compression we want to use here, gzip, bzip2 (these are the two we have for the current dumps), 7z, xz, zstd, …?

Hydriz added a project: User-Hydriz.Mar 21 2017, 4:19 AM

Shall this just include RdfProducer::PRODUCE_TRUTHY_STATEMENTS?

We potentially at least also want RdfProducer::PRODUCE_PROPERTIES ("Add entity definitions for properties used in the dump"), RdfProducer::PRODUCE_VERSION_INFO ("Produce metadata header containing software version info and copyright.") and RdfProducer::PRODUCE_NORMALIZED_VALUES ("Produce normalized values for values with units."). Possibly we also want RdfProducer::PRODUCE_RESOLVED_ENTITIES ("Produce definitions for all entities used in the dump"), although I'm not sure what the indications of that exactly are.

Maybe interesting: EntityDataSerializationService::getFlavor (although that doesn't have a "truthy" flavor).

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

Hadyelsahar subscribed.Mar 28 2017, 11:20 AM

In T155103#3134531, @Smalyshev wrote:

It should include all the statements ttl dump includes, i.e. flavor=dump. So, RdfProducer::PRODUCE_TRUTHY_STATEMENTS should be in. Property/entity resolution is not necessary for the dump, since all entities/properties are included anyway, by virtue of it being full dump.

We're not talking about a full nt dump here (that's T144103), but just a truthy dump.

To answer all of those to my best knowledge at once:

bzip2 should be fine
Including property definitions will most likely be very handy, though I am not sure what that means exactly- it just includes all statements for properties? Than yes!
Version info absolutely
Normalized values make a lot of sense
Not sure what resolved entities should therefore I can't say whether it makes sense

Thanks, looking forward to the truthy dump :D

Change 346636 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

gerritbot added a project: Patch-For-Review.Apr 5 2017, 9:07 PM

hoo created subtask T162346: Include truthy nt dumps in the Wikidata Dump Downloads Grafana dashboard.Apr 6 2017, 9:11 AM

hoo updated the task description. (Show Details)Apr 6 2017, 9:19 AM

I would suggest to put the dumps like https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-truthy-BETA.nt.gz (compared to https://dumps.wikimedia.org/wikidatawiki/entities/20170403/wikidata-20170403-all-BETA.ttl.gz for the current full ttl dump).

Change 346636 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/346636

Change 347234 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 347840 had a related patch set uploaded (by Hoo man):
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

Change 347840 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.29.0-wmf.19] dumpRdf: Allow creating truthy dumps

https://gerrit.wikimedia.org/r/347840

I did a (test) dump of test.wikidata.org just now: https://people.wikimedia.org/~hoo/tmp/testwikidata-20170412-truthy-BETA.nt.gz

Please have a look at it!

thanks a lot for the help

it looks ok, only a small question .

is it normal to have UTF labels being skipped inside the ASCII like that ?, cant we just output everything in UTF-8 or 16

<http://test.wikidata.org/entity/Q145> <http://www.w3.org/2004/02/skos/core#prefLabel> "\u30DD\u30BA\u30CA\u30F3"@ja .

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

@Hadyelsahar Not sure what you mean by "skipped". It's normal for labels to be encoded with \u sequences, since not all tools can handle all Unicode properly, unfortunately. Any tool that reads TTL should be able to handle encoded sequences though.

In T155103#3177252, @Smalyshev wrote:

Should truthy dump also include full property definitions? Because if you use only /prop/direct/ there's not much use to include other predicates, though technically it doesn't hurt anything.

I guess we don't strictly need if for now. Adding things later on is trivial, so we could just go with this and expand it if there's an actual need for these?

Change 348095 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 had a related patch set uploaded (by Hoo man):
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

Change 347234 merged by ArielGlenn:
[operations/puppet@production] Change dumpwikidatattl to allow producing other flavors

https://gerrit.wikimedia.org/r/347234

Change 347838 merged by ArielGlenn:
[operations/puppet@production] Allow running two dumpwikidatattl dumps side by side

https://gerrit.wikimedia.org/r/347838

Change 348095 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Allow nt RDF dumps

https://gerrit.wikimedia.org/r/348095

Change 348096 merged by ArielGlenn:
[operations/puppet@production] Create truthy nt Wikidata entity dump each Monday

https://gerrit.wikimedia.org/r/348096

The first truthy nt dump should appear next Tuesday (probably late UTC).

I'll keep this open until we actually have it.

hoo created subtask T163328: Add the truthy nt dump to dcat-AP.Apr 19 2017, 2:23 PM

@hoo Your help is much appreciated :) thanks a lot

The first truthy nt dump can be found at https://dumps.wikimedia.org/wikidatawiki/entities/20170418/. New truthy nt dumps will appear weekly on https://dumps.wikimedia.org/wikidatawiki/entities/ at about the same time (mid to late Wednesday UTC).

It is considerably smaller than the full dump in file size (especially considering that this is an nt dump, not a ttl one). I don't have any numbers regarding the number of triples in each dump type, but I expect the truthy dump to have considerably fewer triples than the full dump.

Lea_Lacroix_WMDE subscribed.Apr 20 2017, 8:29 AM

Lucie awarded a token.May 28 2017, 9:45 AM

Lucie created subtask T166461: Add documentation of truthy nt dumps.May 28 2017, 9:49 AM

Lydia_Pintscher closed subtask T166461: Add documentation of truthy nt dumps as Resolved.Jun 11 2017, 5:00 PM

ArielGlenn moved this task from Backlog to Done on the Dumps-Generation board.Jun 19 2017, 9:38 AM

hoo added a parent task: T88991: improve Wikidata dumps [tracking].Apr 10 2018, 2:15 PM

Lokal_Profil closed subtask T163328: Add the truthy nt dump to dcat-AP as Resolved.Jul 3 2018, 6:27 AM