Page MenuHomePhabricator

Consider sorting nt dumps and omit duplicate rows
Closed, DeclinedPublic

Description

This might make the resulting file quite a bit smaller and easier to handle, but I can't tell how much.

People seem to have already looked into that for Freebase in the past, see http://kmkeen.com/gz-sort/. A solution like https://stackoverflow.com/a/24581206 might also work well here.

Event Timeline

hoo created this task.Oct 5 2017, 8:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2017, 8:08 PM
hoo renamed this task from Consider sorting nt dumps to Consider sorting nt dumps and omit duplicate rows.Oct 5 2017, 8:11 PM
hoo added a comment.Oct 8 2017, 9:23 PM

I ran the above mentioned tool on a slow-ish VM over the latest truthy dump:

$ time ~/gz-sort/gz-sort -u -S 100M wikidata-20170927-truthy-BETA.nt.gz ~/wikidata-20170927-truthy-BETA.nt.sort.gz
 line count: 1924967162
 presort: 219.15 minutes
 merge 396083: 186.55 minutes
 merge 792167: 183.47 minutes
 merge 1584335: 182.98 minutes
 merge 3166064: 183.37 minutes
 merge 6332128: 183.77 minutes
 merge 12664257: 183.28 minutes
 merge 25328515: 183.90 minutes
 merge 50657030: 185.42 minutes
 merge 101314061: 217.00 minutes
 merge 192496716: 219.67 minutes
 merge 384993432: 218.62 minutes
 merge 641655720: 217.23 minutes
 merge 962483581: 224.97 minutes
removed 303419 non-unique lines

real    2789m32.668s
user    2598m34.233s
sys     18m21.880s

The resulting gzipped file was about 4% larger, but that was probably due to it not being compressed with -9. Sadly I accidentally deleted the sorted dump, thus I can't check how large it would be with gzip -9 or other compressions… but I kind of doubt that's worth it.

hoo closed this task as Declined.Oct 25 2017, 6:31 PM

Removing a measly 300k lines is probably not worth it. Declining.