Page MenuHomePhabricator

Create .nt (NTriples) dumps for wikidata data
Closed, ResolvedPublic

Description

Currently we have dumps in JSON and Turtle (.ttl) formats. It may be useful to have dumps in .nt (NTriples) format since this format is line-based and much easier to process with non-RDF tools.

Thus, I think it makes sense to create such dump in https://dumps.wikimedia.org/wikidatawiki/entities/

The feedback about the idea on wikidata list was largely positive.

In fact, since NTriples is subset of Turtle, we may want to phase out .ttl dumps eventually and have only .nt dumps, but this is not necessary to do in one step. It may be better to have both for a while to give people chance to change their tools and test them on .nt dump.

Event Timeline

thiemowmde added subscribers: daniel, aude, hoo, Jonas.

The current infrastructure supports this easily, if we are ok with the dump not being consistent with the ttl one (if @ArielGlenn is ok with this, I can easily set that up in less than an hour).

If we want the dumps to be consistent, we will either need a maintenance script that can create both dumps at once, or we find a way to derive one dump from another (see T94019: Generate RDF from JSON for this).

I think since there's talk about phasing out .ttl anyway, and json/nt are not generated from one another, it will be ok for .nt too. But if somebody else objects let's hear it.

Smalyshev raised the priority of this task from Low to Medium.Sep 14 2016, 11:48 PM

When you say 'not consistent', what do you mean precisely?

When you say 'not consistent', what do you mean precisely?

Given the n-triples version will be its own cron job, it will capture the state of the wiki at the time this cron job runs (and that will differ from the state of the currently existing ttl dump).

Doesn't bother me at all. You have the go-ahead from me.

I'd prefer all RDF dumps to be created from an initial JSON dump. But that's just a mid term dream, not a blocker to this.

For the record, I definitely want to keep ttl as an output format. But I'm fine with dropping it in favor of nt for our dumps periodic dumps.

Just for the sake of clarity, we're not talking about removing any formats, and any changes in the dumper code at all. We just are talking about running the same script with different params, so it would produce .nt output.

Hi,

I don't really think nt adds more value. If you produce valid turtle, there are tools such as Raptor RDF Syntax Library that easily convert between different RDF syntaxes. Everyone that really needs nt can do this fairly easily themselves, i.e.

rapper --input turtle --output ntriples *.ttl

Andreas

I don't think on 12G dump anything is "easily". Processing it will take time. So if NT format is useful, why not save time to people so they don't have to do it, individually and repeatedly.

Now, if there are tools that allow to quickly convert dumps instead of generating them independently, that may be worth considering.

Lydia_Pintscher subscribed.

Based on the discussion on the mailinglist I think it is ok to go ahead with this.

I played with this a bit: Converting the dump using rapper doesn't work (it tries to load the whole dump into memory before converting). I was able to convert it using [[https://drobilla.net/software/serd|serdi]], though (I didn't verify the result, but it looks good at a glance).

I generated the following ntriples dumps:

-rwxrwxrwx. 1 hoch_m hoch_m 20000518472 30. Dez 03:47 wikidata-20161226-all-BETA.nt.gz
-rwxrwxrwx. 1 hoch_m hoch_m 14465063185  5. Jan 14:58 wikidata-20161226-all-BETA.nt.zst

As you can see, the one compressed with zst is quite small (and can be unpacked very fast). I compressed it using pzstd (but I didn't record the compression level, maybe 18).

Also note due to T154531 our .nt generated dumps would currently be broken until it's fixed. Hopefully converted dumps are still OK.

Should we try pzstd on .ttl dumps too? Looks like it achieves significant reduction (though .nt is much easier to reduce than .ttl).

Also note due to T154531 our .nt generated dumps would currently be broken until it's fixed. Hopefully converted dumps are still OK.

Yes, it looks fine when converted with serdi: <http://www.wikidata.org/entity/Q33742> <http://schema.org/description> "language naturally spoken by humans, as opposed to \"formal\" or \"built\" languages"@en ..

Should we try pzstd on .ttl dumps too? Looks like it achieves significant reduction (though .nt is much easier to reduce than .ttl).

Yeah, we can give it a shot… although I can't do it myself right now (don't have a server with zstd right now). Adding a new compression format should be carefully considered, because removing it after is hard.

I tried compressing a dump with zstd, the result is:

-rw-r--r-- 1 smalyshev wikidev 9.0G Jan 12 02:58 wikidata.ttl.zstd

Original sizes:

-rw-rw-r-- 1 abcdefg icinga 9.0G Jan 11 00:47 /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.bz2
-rw-rw-r-- 1 abcdefg icinga  12G Jan 10 21:55 /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.gz

Looks like not much advantage over bz2. Command line used was:

gunzip -c /public/dumps/public/wikidatawiki/entities/20170109/wikidata-20170109-all-BETA.ttl.gz | ./zstd -15 -o /data/scratch/wdqs/wikidata.ttl.zstd

Tried conversion of current dump with serdi, took 8hrs on labs, .nt.gz is 64G (original .ttl.gz is 38G) - .bz2 should be around 52G. I think converting .ttl to .nt with serdi can be workable.

Change 447922 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Create wikidata ntriples dump from ttl dump

https://gerrit.wikimedia.org/r/447922

Change 447922 merged by ArielGlenn:
[operations/puppet@production] Create wikidata ntriples dump from ttl dump

https://gerrit.wikimedia.org/r/447922

The above change is now live on snapshot1008 (where this job runs) and will take effect for the next run on Monday morning.

Smalyshev claimed this task.