HDT is a compact binary format for RDF that can also support efficient querying. On the mailing list, people have requested that we offer an HDT dump in addition to the TTL dumps, allowing them to run queries on their own systems that would take too long to run on the Wikidata Query Service.
There is an rdf2hdt tool (link; LGPLv2.1+) that can convert TTL dumps to HDT files. Unfortunately, it doesn’t run in a streaming fashion (it doesn’t even open the output file until it’s done converting) and seems to require almost as much memory as the uncompressed TTL dump to run. I tried to run it on the latest Wikidata dump, but the program was OOM-killed after having consumed 2.32 GiB of the gzipped input dump (according to pv), which corresponds to 15.63 GiB of uncompressed input data; the last VmSize before it was killed was 13.04 GiB. As the full uncompressed TTL dump is 187 GiB (201 GB), it looks like we would need a machine with at least ~200 GB of memory to do the conversion.
(Perhaps we could get away with using lots of swap space instead of actual RAM – I have no idea what kind of memory access patterns the tool has.)
As for the processing time, on my system 9% of the dump were processed in 23 minutes, so the full conversion would probably take some hours, but not days. The CPU time as reported by Bash’s time builtin was actually less than the wall-clock time, so it doesn’t look like the tool is multi-threaded. But of course it’s possible that there is some additional phase of processing after the tool is done reading the file, and I have no idea how long that could take.
See also rdfhdt/hdt-cpp#119 for some discussion on converting large datasets. For now, it seems that the large memory requirement is expected. The discussion also points to a MapReduce-based implementation, but there haven’t been any commits to it for a year, and I have no idea if it’s currently possible to use it (there seems to be some build failure, at least).