Create .nt (NTriples) dumps for wikidata data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Aug 27 2016, 7:17 PM

Description

Currently we have dumps in JSON and Turtle (.ttl) formats. It may be useful to have dumps in .nt (NTriples) format since this format is line-based and much easier to process with non-RDF tools.

Thus, I think it makes sense to create such dump in https://dumps.wikimedia.org/wikidatawiki/entities/

The feedback about the idea on wikidata list was largely positive.

In fact, since NTriples is subset of Turtle, we may want to phase out .ttl dumps eventually and have only .nt dumps, but this is not necessary to do in one step. It may be better to have both for a while to give people chance to change their tools and test them on .nt dump.

Details

	Subject	Repo	Branch	Lines +/-
	Create wikidata ntriples dump from ttl dump	operations/puppet	production	+67 -12

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Resolved	Smalyshev	T154531 Bad formatting for quotes in .nt export
Resolved	Smalyshev	T144103 Create .nt (NTriples) dumps for wikidata data
Resolved	Lokal_Profil	T154914 Add .nt to DCAT-AP for Wikidata dumps
Resolved	hoo	T176844 Look into truthy nt dump performance

Event Timeline

Smalyshev created this task.Aug 27 2016, 7:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2016, 7:17 PM

Smalyshev added a project: Wikidata-Query-Service.Aug 27 2016, 7:18 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 27 2016, 7:18 PM

Smalyshev moved this task from Incoming to SDAW on the Wikidata-Query-Service board.Aug 27 2016, 7:23 PM

-jem- subscribed.Aug 28 2016, 12:09 AM

thiemowmde triaged this task as Low priority.Aug 31 2016, 8:22 AM

thiemowmde added subscribers: daniel, aude, hoo, • Jonas.

Smalyshev moved this task from Needs triage to WDQS on the Discovery-ARCHIVED board.Aug 31 2016, 11:31 PM

The current infrastructure supports this easily, if we are ok with the dump not being consistent with the ttl one (if @ArielGlenn is ok with this, I can easily set that up in less than an hour).

If we want the dumps to be consistent, we will either need a maintenance script that can create both dumps at once, or we find a way to derive one dump from another (see T94019: Generate RDF from JSON for this).

I think since there's talk about phasing out .ttl anyway, and json/nt are not generated from one another, it will be ok for .nt too. But if somebody else objects let's hear it.

Smalyshev raised the priority of this task from Low to Medium.Sep 14 2016, 11:48 PM

Smalyshev mentioned this in T117032: Create configuration for specifying units conversions.Sep 19 2016, 5:29 PM

Sylvain_WMFr subscribed.Oct 11 2016, 9:41 AM

When you say 'not consistent', what do you mean precisely?

In T144103#2752218, @ArielGlenn wrote:

When you say 'not consistent', what do you mean precisely?

Given the n-triples version will be its own cron job, it will capture the state of the wiki at the time this cron job runs (and that will differ from the state of the currently existing ttl dump).

Doesn't bother me at all. You have the go-ahead from me.

I'd prefer all RDF dumps to be created from an initial JSON dump. But that's just a mid term dream, not a blocker to this.

For the record, I definitely want to keep ttl as an output format. But I'm fine with dropping it in favor of nt for our dumps periodic dumps.

Just for the sake of clarity, we're not talking about removing any formats, and any changes in the dumper code at all. We just are talking about running the same script with different params, so it would produce .nt output.

Lokal_Profil subscribed.Nov 4 2016, 10:08 AM

Hydriz subscribed.Nov 4 2016, 10:52 AM

Hi,

I don't really think nt adds more value. If you produce valid turtle, there are tools such as Raptor RDF Syntax Library that easily convert between different RDF syntaxes. Everyone that really needs nt can do this fairly easily themselves, i.e.

rapper --input turtle --output ntriples *.ttl

Andreas

I don't think on 12G dump anything is "easily". Processing it will take time. So if NT format is useful, why not save time to people so they don't have to do it, individually and repeatedly.

Now, if there are tools that allow to quickly convert dumps instead of generating them independently, that may be worth considering.

Based on the discussion on the mailinglist I think it is ok to go ahead with this.

Smalyshev added a parent task: T154531: Bad formatting for quotes in .nt export.Jan 3 2017, 10:51 PM

Lokal_Profil added a subtask: T154914: Add .nt to DCAT-AP for Wikidata dumps.Jan 9 2017, 6:05 PM

I played with this a bit: Converting the dump using rapper doesn't work (it tries to load the whole dump into memory before converting). I was able to convert it using [[https://drobilla.net/software/serd|serdi]], though (I didn't verify the result, but it looks good at a glance).

I generated the following ntriples dumps:

-rwxrwxrwx. 1 hoch_m hoch_m 20000518472 30. Dez 03:47 wikidata-20161226-all-BETA.nt.gz
-rwxrwxrwx. 1 hoch_m hoch_m 14465063185  5. Jan 14:58 wikidata-20161226-all-BETA.nt.zst

As you can see, the one compressed with zst is quite small (and can be unpacked very fast). I compressed it using pzstd (but I didn't record the compression level, maybe 18).

Also note due to T154531 our .nt generated dumps would currently be broken until it's fixed. Hopefully converted dumps are still OK.

Should we try pzstd on .ttl dumps too? Looks like it achieves significant reduction (though .nt is much easier to reduce than .ttl).

In T144103#2931340, @Smalyshev wrote:

Also note due to T154531 our .nt generated dumps would currently be broken until it's fixed. Hopefully converted dumps are still OK.

Yes, it looks fine when converted with serdi: <http://www.wikidata.org/entity/Q33742> <http://schema.org/description> "language naturally spoken by humans, as opposed to \"formal\" or \"built\" languages"@en ..

Should we try pzstd on .ttl dumps too? Looks like it achieves significant reduction (though .nt is much easier to reduce than .ttl).

Yeah, we can give it a shot… although I can't do it myself right now (don't have a server with zstd right now). Adding a new compression format should be carefully considered, because removing it after is hard.

Smalyshev mentioned this in T154531: Bad formatting for quotes in .nt export.Jan 21 2017, 8:40 PM

I tried compressing a dump with zstd, the result is:

-rw-r--r-- 1 smalyshev wikidev 9.0G Jan 12 02:58 wikidata.ttl.zstd

Original sizes:

-rw-rw-r-- 1 abcdefg icinga 9.0G Jan 11 00:47 /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.bz2
-rw-rw-r-- 1 abcdefg icinga  12G Jan 10 21:55 /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.gz

Looks like not much advantage over bz2. Command line used was:

gunzip -c /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.gz | ./zstd -15 -o /data/scratch/wdqs/wikidata.ttl.zstd

hoo mentioned this in T155103: Create a truthy nt dump.Mar 28 2017, 12:18 PM

hoo added a parent task: T88991: improve Wikidata dumps [tracking].Apr 6 2017, 10:39 AM

hoo added a subtask: T176844: Look into truthy nt dump performance.Oct 11 2017, 2:25 PM

Lydia_Pintscher closed subtask T176844: Look into truthy nt dump performance as Resolved.Nov 23 2017, 1:33 PM

Smalyshev mentioned this in T181936: Give misc dump crons their own host.Feb 14 2018, 8:13 PM

Smalyshev added a project: User-Smalyshev.Jul 7 2018, 11:56 PM

Tried conversion of current dump with serdi, took 8hrs on labs, .nt.gz is 64G (original .ttl.gz is 38G) - .bz2 should be around 52G. I think converting .ttl to .nt with serdi can be workable.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Jul 9 2018, 5:00 PM

Smalyshev mentioned this in T154914: Add .nt to DCAT-AP for Wikidata dumps.Jul 20 2018, 6:50 PM

Smalyshev moved this task from Next to In review on the User-Smalyshev board.Jul 25 2018, 9:53 PM

Change 447922 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Create wikidata ntriples dump from ttl dump

https://gerrit.wikimedia.org/r/447922

gerritbot added a project: Patch-For-Review.Jul 25 2018, 9:53 PM

Change 447922 merged by ArielGlenn:
[operations/puppet@production] Create wikidata ntriples dump from ttl dump

https://gerrit.wikimedia.org/r/447922

The above change is now live on snapshot1008 (where this job runs) and will take effect for the next run on Monday morning.

Smalyshev moved this task from In review to Done on the User-Smalyshev board.Sep 21 2018, 6:02 AM

Smalyshev closed this task as Resolved.Sep 26 2018, 7:14 AM

Smalyshev claimed this task.

hoo closed subtask T154914: Add .nt to DCAT-AP for Wikidata dumps as Resolved.Apr 15 2019, 10:06 AM