The HTML dumps are in .tar.gz format, which is uncomfortable because most languages don't offer a simple way to stream data. Ideally we're streaming from a NFS volume or less ideally over HTTP, but the files are too large to save and decompress locally (eg. 100GB compressed). We need to read the file as a stream.
tar itself provides a solution so at a minimum we can run this commandline and stream from stdin:
tar xzf hawiki-NS0-20230301-ENTERPRISE-HTML.json.tar.gz --to-stdout
However, we should explore the source for erl_tar to see if it's possible to stream entirely from the BEAM environment. If it's necessary to use the commandline, here's an interesting attempt to wrap the logic inside of Elixir using "pipes": https://elixirforum.com/t/streaming-tar-files/20246
Work in progress: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/1