Maniphest T332045

Stream input file from a tarball
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Mar 14 2023, 4:23 PM

Description

The HTML dumps are in .tar.gz format, which is uncomfortable because most languages don't offer a simple way to stream data. Ideally we're streaming from a NFS volume or less ideally over HTTP, but the files are too large to save and decompress locally (eg. 100GB compressed). We need to read the file as a stream.

tar itself provides a solution so at a minimum we can run this commandline and stream from stdin:

tar xzf hawiki-NS0-20230301-ENTERPRISE-HTML.json.tar.gz --to-stdout

However, we should explore the source for erl_tar to see if it's possible to stream entirely from the BEAM environment. If it's necessary to use the commandline, here's an interesting attempt to wrap the logic inside of Elixir using "pipes": https://elixirforum.com/t/streaming-tar-files/20246

Work in progress: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/1

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T332162 Run scraper on samples from several wikis
Resolved	None	T332045 Stream input file from a tarball

Event Timeline

awight created this task.Mar 14 2023, 4:23 PM

awight added a parent task: T332162: Run scraper on samples from several wikis.Mar 15 2023, 12:40 PM

awight claimed this task.Mar 15 2023, 12:49 PM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2023-03-14 board.

lilients_WMDE moved this task from Incoming to In progress on the WMDE-References-FocusArea board.Mar 15 2023, 2:15 PM

awight updated the task description. (Show Details)Mar 15 2023, 2:37 PM

I'm begging the maintainers to drop the "tar" layer, here: T298436#8704523 . Any changes might come too late for our project, but at least we'll be able to simplify the code if the day comes.

awight updated the task description. (Show Details)Mar 17 2023, 8:55 AM

So tar format is really made for streaming, so I am surprised that this is hard to do in your programming language. Seeking is what is problem in tar, but streaming is really easy. It is really just concatenation of files. So it is similar to any other buffered stream.

In T332045#8704761, @Mitar wrote:

So tar format is really made for streaming, so I am surprised that this is hard to do in your programming language. Seeking is what is problem in tar, but streaming is really easy. It is really just concatenation of files. So it is similar to any other buffered stream.

Interesting point! However, the use cases for tar are streaming from a filesystem to an abstract, linear storage (tape) and then streaming from tape onto the filesystem. It's not designed for streaming the *contents* of any individual file through memory and processing without writing to disk first. Although --to-stdout exists and sort-of undermines my point here :-)

I see that Go's tar library and Java's have elegant support for streaming single files from a tarball, so I tend to agree with what you're saying. I guess the question is whether there's any important property that we gain by using tar for these dumps?

In large dumps there are multiple files inside one archive. So tar serves as a standard way to combine those multiple files into one file, and then compression is made over all of that.

In T332045#8705029, @Mitar wrote:

In large dumps there are multiple files inside one archive. So tar serves as a standard way to combine those multiple files into one file, and then compression is made over all of that.

Oh that's really helpful to know, thank you! And if the names are collated then tar's --to-stdout even does the right thing by default, by contcatenating files. I'm starting to think your understanding of tar might be the correct one :-)

Is this splitting behavior (size and naming) documented somewhere I can find? I don't see a place in the internal filename for splits, "hawiki_0.ndjson"

On a tangent, I couldn't run the https://enterprise.wikimedia.com/docs/snapshot/ API without an account, but is this redirecting to the same tarballs, also containing split files?

Yes, I made a library for processing those dumps in Go.

I think I complained somewhere as well that filenames are not documented, but I do not find where. :-) I think they go like hawiki_0.ndjson, hawiki_1.ndjson and so on.

I am not familiar with Snapshot API, I just use public dumps available.

@Mitar I was able to wire up a streaming decoder, so the file format no longer bothers me :-) . Thanks again for the input!

awight removed awight as the assignee of this task.Mar 20 2023, 9:06 AM

awight closed this task as Resolved.Mar 22 2023, 9:45 AM

awight moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2023-03-14 board.

Stream input file from a tarballClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Stream input file from a tarball
Closed, ResolvedPublic
Actions

Related Objects
Search...