Page MenuHomePhabricator

Stream input file from a tarball
Closed, ResolvedPublic

Description

The HTML dumps are in .tar.gz format, which is uncomfortable because most languages don't offer a simple way to stream data. Ideally we're streaming from a NFS volume or less ideally over HTTP, but the files are too large to save and decompress locally (eg. 100GB compressed). We need to read the file as a stream.

tar itself provides a solution so at a minimum we can run this commandline and stream from stdin:

tar xzf hawiki-NS0-20230301-ENTERPRISE-HTML.json.tar.gz --to-stdout

However, we should explore the source for erl_tar to see if it's possible to stream entirely from the BEAM environment. If it's necessary to use the commandline, here's an interesting attempt to wrap the logic inside of Elixir using "pipes": https://elixirforum.com/t/streaming-tar-files/20246

Work in progress: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/1

Event Timeline

I'm begging the maintainers to drop the "tar" layer, here: T298436#8704523 . Any changes might come too late for our project, but at least we'll be able to simplify the code if the day comes.

Mitar subscribed.

So tar format is really made for streaming, so I am surprised that this is hard to do in your programming language. Seeking is what is problem in tar, but streaming is really easy. It is really just concatenation of files. So it is similar to any other buffered stream.

So tar format is really made for streaming, so I am surprised that this is hard to do in your programming language. Seeking is what is problem in tar, but streaming is really easy. It is really just concatenation of files. So it is similar to any other buffered stream.

Interesting point! However, the use cases for tar are streaming from a filesystem to an abstract, linear storage (tape) and then streaming from tape onto the filesystem. It's not designed for streaming the *contents* of any individual file through memory and processing without writing to disk first. Although --to-stdout exists and sort-of undermines my point here :-)

I see that Go's tar library and Java's have elegant support for streaming single files from a tarball, so I tend to agree with what you're saying. I guess the question is whether there's any important property that we gain by using tar for these dumps?

In large dumps there are multiple files inside one archive. So tar serves as a standard way to combine those multiple files into one file, and then compression is made over all of that.

In large dumps there are multiple files inside one archive. So tar serves as a standard way to combine those multiple files into one file, and then compression is made over all of that.

Oh that's really helpful to know, thank you! And if the names are collated then tar's --to-stdout even does the right thing by default, by contcatenating files. I'm starting to think your understanding of tar might be the correct one :-)

Is this splitting behavior (size and naming) documented somewhere I can find? I don't see a place in the internal filename for splits, "hawiki_0.ndjson"

On a tangent, I couldn't run the https://enterprise.wikimedia.com/docs/snapshot/ API without an account, but is this redirecting to the same tarballs, also containing split files?

Yes, I made a library for processing those dumps in Go.

I think I complained somewhere as well that filenames are not documented, but I do not find where. :-) I think they go like hawiki_0.ndjson, hawiki_1.ndjson and so on.

I am not familiar with Snapshot API, I just use public dumps available.

@Mitar I was able to wire up a streaming decoder, so the file format no longer bothers me :-) . Thanks again for the input!

awight moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2023-03-14 board.