Page MenuHomePhabricator

Add HDT dump of Wikidata
Open, Stalled, LowPublic

Description

HDT is a compact binary format for RDF that can also support efficient querying. On the mailing list, people have requested that we offer an HDT dump in addition to the TTL dumps, allowing them to run queries on their own systems that would take too long to run on the Wikidata Query Service.

There is an rdf2hdt tool (link; LGPLv2.1+) that can convert TTL dumps to HDT files. Unfortunately, it doesn’t run in a streaming fashion (it doesn’t even open the output file until it’s done converting) and seems to require almost as much memory as the uncompressed TTL dump to run. I tried to run it on the latest Wikidata dump, but the program was OOM-killed after having consumed 2.32 GiB of the gzipped input dump (according to pv), which corresponds to 15.63 GiB of uncompressed input data; the last VmSize before it was killed was 13.04 GiB. As the full uncompressed TTL dump is 187 GiB (201 GB), it looks like we would need a machine with at least ~200 GB of memory to do the conversion. (Perhaps we could get away with using lots of swap space instead of actual RAM – I have no idea what kind of memory access patterns the tool has.)

As for the processing time, on my system 9% of the dump were processed in 23 minutes, so the full conversion would probably take some hours, but not days. The CPU time as reported by Bash’s time builtin was actually less than the wall-clock time, so it doesn’t look like the tool is multi-threaded. But of course it’s possible that there is some additional phase of processing after the tool is done reading the file, and I have no idea how long that could take.

See also rdfhdt/hdt-cpp#119 for some discussion on converting large datasets. For now, it seems that the large memory requirement is expected. The discussion also points to a MapReduce-based implementation, but there haven’t been any commits to it for a year, and I have no idea if it’s currently possible to use it (there seems to be some build failure, at least).

Related Objects

Event Timeline

it doesn’t even open the output file until it’s done converting

That might be a problem when we have 4bn triples... I think "load the whole thing is memory" is a doomed approach - even if we find a way to get past memory limits for current dump, what would happen when it doubles in size?

The idea that you need to keep everything in memory to compress/optimize is of course not true - you can still do pretty fine with disk-based storage, that's what Blazegraph does for example and probably nearly every other graph DB. Yes if would be a bit slower and requires some careful programming, but it's not something that should be impossible. Unfortunately, https://github.com/rdfhdt/hdt-cpp/issues/119 sounds like people behind HDT are not interested in doing this work. Without it, the idea of converting Wikidata data set is a no go, unfortunately - I do not see how Wikidata data set can be served with "load up everything in memory" paradigm. If we find somebody that wants/can do the work that allows HDT to process large datasets, then I think it is a good idea to have it in dumps, but not before that.

FWIW, I've just tried to convert the ttl dump of the 1st of November 2017 on a machine with 378 GiB of RAM and 0 GiB of swap and… well… it failed with std::bad_alloc after more than 21 hours of runtime. Granted, there was another process eating ~100 GiB of memory, but I thought it would be okay — so I'm proved wrong.

As I was optimistic, I ran the conversion directly from the ttl.gz file, maybe preventing some memory mapping optimization, and also added the -i flag to generate the index at the same time. I'll re-run the conversion without these in the hope of finally getting the hdt file.

So, here are the statistics I got:

$ /usr/bin/time -v rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz  wikidata-20171101-all.hdt
Catch exception load: std::bad_alloc
ERROR: std::bad_alloc
Command exited with non-zero status 1
        Command being timed: "rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz wikidata-20171101-all.hdt"
        User time (seconds): 64999.77
        System time (seconds): 10906.79
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 21:13:25
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 200475524
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 703
        Minor (reclaiming a frame) page faults: 8821385485
        Voluntary context switches: 36774
        Involuntary context switches: 4514261
        Swaps: 0
        File system inputs: 81915000
        File system outputs: 2767696
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1
/usr/bin/time -v rdf2hdt -f ttl -i -p wikidata-20171101-all.ttl.gz   64999,77s user 10906,80s system 99% cpu 21:13:25,50 total

NB: the exceptionally long runtime is the result of the conversion being single-threaded while the machine has a lot of threads but a relatively low per-thread performance (2.3 Ghz). The process wasn't under memory pressure until it crashed (no swap anyway) and wasn't waiting much for I/O — so it was all CPU-bound.

@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
I'm aware that isnt a HDT dump, but im wondering if this would help in any way.

I ran the conversion directly from the ttl.gz file

Interesting, I couldn’t get that to work and had to pipe gunzip output into the program.

I also tried converting the latest dump, and since I don’t have access to any system with that much RAM, I thought I could perhaps trade some execution time for swap space. Bad idea :) the process got through 20% of the input file and then slowed to a crawl, at data rates of single-digit kilobytes per second. It would’ve taken half a year to finish at that rate.

But FWIW, here’s the command I used, with a healthy dose of systemd sandboxing since it’s a completely unknown program I’m running:

time pv latest-all.ttl.gz |
    gunzip |
    sudo systemd-run --wait --pipe --unit rdf2hdt \
        -p CapabilityBoundingSet=CAP_DAC_OVERRIDE \
        -p ProtectSystem=strict p PrivateNetwork=yes -p ProtectHome=yes -p PrivateDevices=yes \
        -p ProtectKernelTunables=yes -p ProtectControlGroups=yes \
        -p NoNewPrivileges=yes -p RestrictNamespaces=yes \
        -p MemoryAccounting=yes -p CPUAccounting=yes -p BlockIOAccounting=yes -p IOAccounting=yes -p TasksAccounting=yes \
        /usr/local/bin/rdf2hdt -i -f ttl -B 'http://wikiba.se/ontology-beta#Dump' /dev/stdin /dev/stdout \
    >| wikidata-2017-11-01.hdt

I had to make install the program because the libtoolized dev build doesn’t really support being run like that. (See systemd/systemd#7254 for the CapabilityBoundingSet part – knowing what I know now, -p $USER would’ve been the better choice.)

@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
I'm aware that isnt a HDT dump, but im wondering if this would help in any way.

Can we reliably get a consistent snapshot of those files when BlazeGraph is constantly writing updates to them?

@Smalyshev we discussed dumping the JNL files used by blaze graph directly at points during wikidata con.
I'm aware that isnt a HDT dump, but im wondering if this would help in any way.

Can we reliably get a consistent snapshot of those files when BlazeGraph is constantly writing updates to them?

It would probably be easy enough to pause updating on a host, turn off blazegraph, rsync the file to somewhere and turn blazegraph and the updater back on.
The real question is would this be usefull for people.

With the docker images that I have created it would mean that on a docker host / in a docker container with enough resources people could spin up a matching version of blazegraph and query the data with no timeout.

I ran the conversion directly from the ttl.gz file

Interesting, I couldn’t get that to work and had to pipe gunzip output into the program.

Interesting, indeed… Could it be that you added the -f ttl flag afterwards? I couldn't get it to accept a gzip file as input without this flag (I assume it does file format detection based on the file extension).

Also, I had to install zlib-devel to get rdfhdt to compile on a CentOS 6 container — there might be some non-zlib-enabled build on Debian that isn't available on RedHat.

I also tried converting the latest dump, and since I don’t have access to any system with that much RAM, I thought I could perhaps trade some execution time for swap space. Bad idea :) the process got through 20% of the input file and then slowed to a crawl, at data rates of single-digit kilobytes per second. It would’ve taken half a year to finish at that rate.

Thanks for testing! That would have required a hell lot of swap space anyway. Easy to setup for whoever does this on a regular basis, but for casual needs, I've never seen a machine with 200+ GiB of swap space.

But FWIW, here’s the command I used, with a healthy dose of systemd sandboxing since it’s a completely unknown program I’m running:
<snip>

Thanks for sharing the sandboxing bits! :-)

I'm afraid the current implementation of HDT is not ready to handle more than 4 billions triples as it is limited to 32 bit indexes. I've opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.

Smalyshev changed the task status from Open to Stalled.Nov 7 2017, 5:55 PM
Smalyshev triaged this task as Low priority.
Smalyshev changed the task status from Stalled to Open.Dec 16 2017, 12:44 AM

This: https://lists.wikimedia.org/pipermail/wikidata/2017-December/011607.html suggests that 64-bit version can process the whole dataset successfully, see also: https://lists.wikimedia.org/pipermail/wikidata/2017-December/011588.html. Thus, I imagine, it should be possible to create HDT from dumps, but not sure if we can/want to do this on Wikimedia infrastructure or just refer to external resource like one kindly provided by Wouter Beek.

Smalyshev changed the task status from Open to Stalled.May 28 2019, 11:51 PM
Smalyshev added a project: patch-welcome.

The 32-bit issue at https://github.com/rdfhdt/hdt-cpp/issues/135 that was mentioned above seems to be resolved, so perhaps this can be revisited now?

As I was having some issues with compiling the code I used a docker instance directly for the conversion unfortunately it failed due to rdf syntax reasons while using the latest database. As I didn't time it I cannot give any details yet about the performance.

? wikidata sudo docker run -v pwd:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -p -i wikidata/latest-all.nt.gz wikidata/latest-all.hdt
error: wikidata/latest-all.nt.gz:604276348:139: bad IRI scheme char `2F'
Catch exception load: Error parsing input.
ERROR: Error parsing input.

Small update from my side. After downloading the latest ttl file from Wikidata I receive no errors but also no output. I tried the exact command with a small dataset and that worked.

time sudo docker run -v `pwd`:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -f turtle -p -i wikidata/latest-all.ttl.gz wikidata/latest-all.hdt

sudo docker run -v `pwd`:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -f turtle -p  19.75s user 13.90s system 0% cpu 50:21:55.81 total

So I am not exactly sure what is happening. This is the temp first 103 lines of the turtle file.

time sudo docker run -v `pwd`:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -f turtle -p -i wikidata/tmp.ttl.gz wikidata/tmp.hdt       
Predicate Bitmap in 21 usp: 0 % / 5.4 %                                        
Count predicates in 17 userences: 0 % / 6.75 %                      
Count Objects in 8 us Max was: 8: 0 % / 27 %                      
Bitmap in 9 usx bitmap: 0 % / 39.6 %                      
Bitmap bits: 56 Ones: 38
Object references in 23 usces: 0 % / 42.75 %                      
Sort lists in 17 uslists: 0 % / 64.8 %                      
Index generated in 119 us
sudo docker run -v `pwd`:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -f turtle -p  0.04s user 0.03s system 1% cpu 4.868 total

and then I can access the turtle file on the local drive.