We need to figure out a good distribution format for our HTML dumps. Ideally, this format should:
- offer compact downloads
- be supported in most consumer environments
- ideally, support incremental builds / updates
- ideally, support random access to individual articles / revisions
Our current HTML dumper simply creates a directory per title, with a single file named after the revision number inside it. While simple, this does not scale too well on some file systems. A tar file with 10 million subdirectories in a single directory would not work well for many users.
One option to avoid this is to use a subdirectory tree based on the actual title (like /F/Fo/Foo/12345) or based on a hash of the title. However, working with such a tree is not very straightforward and requires significant custom client-side code.
Another option is to distribute a sqlite database keyed on title and revision (lzma-compressed, e.g. en.wikipedia.org_articles.sqlite.xz). A major advantage of this option is wide client support and random access support out of the box, as well as lack of special requirements on the file system. It is also easy to extend this format with additional metadata, and users can directly construct indexes on this metadata. The biggest question mark for this option is performance for large db sizes, although posts like this one describe settings that seem to work with the database sizes we need.
A format we'll likely offer in any case is ZIM, which is used by the Kiwix offline reader. The main issue with using ZIM as our primary format would be less-than-ubiquitous tooling support on various OS/language combinations. There is ongoing work to support incremental updates and diffing between ZIM files in T49406: Incremental update: zimdiff & zimpatch.