Page MenuHomePhabricator

Decide on format options for HTML and possibly other dumps
Closed, ResolvedPublic

Description

We need to figure out a good distribution format for our HTML dumps. Ideally, this format should:

  • offer compact downloads
  • be supported in most consumer environments
  • ideally, support incremental builds / updates
  • ideally, support random access to individual articles / revisions

Our current HTML dumper simply creates a directory per title, with a single file named after the revision number inside it. While simple, this does not scale too well on some file systems. A tar file with 10 million subdirectories in a single directory would not work well for many users.

One option to avoid this is to use a subdirectory tree based on the actual title (like /F/Fo/Foo/12345) or based on a hash of the title. However, working with such a tree is not very straightforward and requires significant custom client-side code.

Another option is to distribute a sqlite database keyed on title and revision (lzma-compressed, e.g. en.wikipedia.org_articles.sqlite.xz). A major advantage of this option is wide client support and random access support out of the box, as well as lack of special requirements on the file system. It is also easy to extend this format with additional metadata, and users can directly construct indexes on this metadata. The biggest question mark for this option is performance for large db sizes, although posts like this one describe settings that seem to work with the database sizes we need.

A format we'll likely offer in any case is ZIM, which is used by the Kiwix offline reader. The main issue with using ZIM as our primary format would be less-than-ubiquitous tooling support on various OS/language combinations. There is ongoing work to support incremental updates and diffing between ZIM files in T49406: Incremental update: zimdiff & zimpatch.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added subscribers: Hydriz, scfc, Ricordisamoa and 13 others.

A few notes from IRC discussion.

First, http://mattmahoney.net/dc/text.html is the canonical source for comparison of compression alternatives. There are some really good ones these days. Some of our biggest wins might come from preprocessing the HTML with an XML-aware tool, like xml-wrt http://mattmahoney.net/dc/text.html#1512

So, the current bzip2 stream + index of articles lets you do random-access inside of the dump without decompressing the entire thing; see https://github.com/cscott/seek-bzip for a recent implementation, although the technique has been well-known for a while. OLPC used this to do real-time decompression from a compressed dump on a very slow processor from very-slow flash on the XO-1. You can also do HTTP Byte-Range requests in order to fetch specific articles from the WMF dumps without downloading the entire dump; nell-wikipedia was going to do this (which is why I wrote seek-bzip) but I didn't have time to implement it.

In my opinion (and gwicke disagrees), it's probably worth doing an bzip2-stream format initially, since that's what everyone's got tools for, and then handle the upgrade to a better format as an incremental feature.

On the other hand, distributing an sqlite3 format database is not a terrible option, if you're going to make a clean break with compatibility. It's a shame to have to store the articles uncompressed, and then compress the entire .db file, since that means you can't do anything with the dump without decompressing the whole time. On the other hand, that probably gives better compression and more flexibility in choosing a compression format.

One benefit of the sqlite3 format is that you can distribute a master database with article contents keyed by revision, and then a few smaller tables mapping titles to revisions and other article metadata (redirects, categories, revision timestamps, image information, etc). These smaller tables can be distributed in separate files.

If we did distribute a filesystem tree, one benefit would be that you could probably run rsync in order to efficiently update a dump. That might be very useful for some purposes.

I'm not sure what's the purpose of a DB-like format. Ideally, dumps should be dumps: so, merely a directory, perhaps a tarball or 7z file.

If the discussion on formats keeps consuming energies as above, the obvious thing to do is to avoid wheel reinvention and adopt ZIM.

I think a NoSql solution like a key-value store is ideal for providing HTML dumps -- it has a natural mapping: key:title => value:HTML. Any additional metadata can be easily modeled in the key (e.g. title\0timestamp\0revision => HTML)

kvstores also seems to satisfy all of the requirements listed:

  1. offer compact downloads: kvstore like LevelDB/RocksDB [1][2] would produce thousands of files but all those files are in compressed format, so just tar'ing them would be sufficient for providing a single file download.
  2. be supported in most consumer environments: Bindings for kvstores like LevelDB are available in nearly all languages and takes just a few lines of code to read/write data. Simple command line utility for reading/writing/iterating can also be provided.
  3. ideally, support incremental builds / updates: Almost any kvstore supports adding/deleting.
  4. ideally, support random access to individual articles / revisions: kvstores provide random lookup by key and also iterating over key-value pairs. Data is decompressed on the fly.

[1] https://github.com/google/leveldb
[2] https://github.com/facebook/rocksdb

The main issues with using ZIM as our primary format would be the need for specialized and not very widely available libraries for read access

What do you mean? They're even packaged in Ubuntu, AFAICS.

a lack of incremental update support.

According to T49406#492033, incremental update already works. Testing and merging that patch may be a wiser use of time.

The main issues with using ZIM as our primary format would be the need for specialized and not very widely available libraries for read access

What do you mean? They're even packaged in Ubuntu, AFAICS.

That's great to hear. I think you'd probably agree though that there is less out-of-the-box support for working with OpenZIM in various OS/distro x programming language combinations than there is for XML, tar files or SQLite dbs.

I think it's important that we offer at least one lowest-common-denominator format that is easy to work with on any platform. ZIM does not seem to be that at this point, so I don't think it would be wise to offer only ZIM.

a lack of incremental update support.

According to T49406#492033, incremental update already works. Testing and merging that patch may be a wiser use of time.

I wasn't aware of that work, so thanks for the pointer. Updated the description to reflect this.

That's great to hear. I think you'd probably agree though that there is less out-of-the-box support for working with OpenZIM in various OS/distro x programming language combinations than there is for XML, tar files or SQLite dbs.

I can't tell. tar is certainly easy for users, but SQLite and XML not so much judging from the mass of confusion and requests for support we constantly see on mediawiki.org from people trying to set them up for MediaWiki.

Here my analysis of the current situation, based on a concrete use-case.

Last WPFR text-only ZIM file is here. It contains all main namespace articles in HTML & all redirects, both revamped for a nice end-users UX:
http://download.kiwix.org/zim/wikipedia/wikipedia_fr_all_nopic_2015-03.zim.torrent

  • Offer compact downloads? It's 2.8 GB big
  • Be supported in most consumer environments ? Can be read on Windows, OSX, Linux, Symbian, x86 & ARM devices with at least 32MB RAM. iOS coming soon. It can be used through command line & with end-user ready readers
  • Ideally, support incremental builds / updates ? Yes, almost (but this topic is a pretty complicated one whatever the format you want to use - It's not only about the format, its' also a real question for the server-side infrastructure and the client side software solution - I would tend to put this aside for now).
  • ideally, support random access to individual articles/revisions ? It was conceived exactly for that purpose: be able to find a balance between compression ratio & random access speed.

I encourage people following this ticket to bench other solutions against this content (WPFR), so we can make comparisons. Otherwise it might be pretty challenging to make a good decision.

About the ZIM toolbox: ZIM does not have a large toolbox like bzip2, but you have already the tools to make almost whatever you want on all desktop OSes:

  • The zimlib is now in Debian/Ubuntu, would be easy to make an RPM
  • zimdump works well on POSIX system and allow to inspect ZIM files on the command line
  • we have a stub kiwix-read command line tool distributed for Win/Linux/OSX which is though to offer a high level command line reader and would be really easy to improve (but we need feature requests for this).
  • Have a look to http://www.openzim.org/wiki/Bindings and http://www.openzim.org/wiki/Readers for a little bit more choice

That said, one of the most important point is not in the ticket description: about which audience are we talking about? The largest audience is composed by end-users and then you need to ask additional questions: Do we have a end-user ready solution to work with these dumps? Are we able to provide dumps with pictures? Are we able to provide differential compression level? With ZIM files we can. Then you have the developers and researchers who might be more interested in having more structured data, a large variety of tools able to deal with the snapshot and care less about the compression level or about having pictures. In that case, other more widely used solutions might be a good approach too. An additional remark: we also have to understand that the HTML end-users are interested in is not the same that the one stored in Parsoid cache!

To conclude: ZIM (generated with an advanced solution like mwoffliner) is IMO, and without any doubt, the best solution for end-users. For developers/researchers text-only ZIM file are a good opportunity too, but considering the level of awareness about the format, the habits, etc... It might make sense to propose a more well-known/barebone solution for raw Parsoid cache files (bzip2, lzma) and in case we want to offer something more sophisticated (more than the title->html stuff), then a solution like sqlite might be adapted too.

Thank you for your summary, @Kelson. I agree with your assessment on ZIM being the best format for end users.

To me the main question at this point is more about which lowest-common-denominator format we should offer for developers. I just added SQLite support in https://github.com/gwicke/htmldumper, and am now testing both tar & sqlite backends with enwiki.

First results:

  • creating a sqlite3 HTML dump of dewiki ns0 HTML takes about 5 hours in labs, with a concurrent tar using IO bandwidth in the background
  • updating a dewiki dump takes about 30 minutes

Sizes, uncompressed vs. xz -2:

42G  de.wikipedia.org.html.current.ns0.sqlite3
6.3G de.wikipedia.org.html.current.ns0.sqlite3.xz
6.7G de.wikipedia.org.html.current.ns0.tar.xz
7.1G fr.wikipedia.org.html.current.ns0.sqlite3.xz
181G en.wikipedia.org.html.current.ns0.sqlite3
25G  en.wikipedia.org.html.current.ns0.sqlite3.xz

Tar-ing up a filesystem-based dump takes a very long time if the IO subsystem is slow. For example, in labs creating a tar of dewiki took about 20 hours, with a lot of random IO wait caused by the tar process. The size of the tar.xz also turns out slightly larger than that of the corresponding sqlite database, which is somewhat surprising.

From a dump creation standpoint sqlite seems to be far superior to filesystem & tar based dumps. Decompressing & compressing a sqlite database is relatively fast using recent multi-threaded xz or other parallel lzma compressors like pixz. The IO pattern is predominantly sequential. Incremental updates are fast, with a relatively modest amount of random reads during index traversal.

Using a database also makes it relatively easy to add revision metadata equivalent to what the current XML dumps provide. The current (WIP) schema looks like this:

CREATE TABLE data(
    title TEXT,
    revision INTEGER,
    tid TEXT,
    body TEXT,
    page_id INTEGER,
    namespace INTEGER,
    timestamp TEXT,
    comment TEXT,
    user_name TEXT,
    user_id INTEGER,
    PRIMARY KEY(title ASC, revision DESC)
);

What do you think?

Nemo_bis lowered the priority of this task from Medium to Low.Apr 9 2015, 7:14 AM
GWicke raised the priority of this task from Low to Medium.Apr 10 2015, 12:06 AM

FWIW, I would vote for compressed sqlite db over custom formats like zim.
Is there any enwiki HTML dump ready for download?

WIP sqlite test dumps are available at http://209.141.56.60/. Enwiki is just being re-generated after a schema change, and should become available in a few hours.

@GWicke Is it possible to use the naming conventions used on dumps.wikimedia.org? The HTML dumps should be named abwiki-$DATE-articles.ns0.sqlite3.xz instead of ab.wikipedia.org.articles.ns0.sqlite3.xz to ensure consistency.

@Hydriz, that's definitely possible. Since these are a different kind of dump without existing users it might be worth revisiting some older decisions on the naming though:

  • DB name (enwiki) vs. domain (en.wikipedia.org): The domain is universally known, while the db name can be fairly obscure for casual users, especially for projects like mediawiki.org.
  • Timestamping: We should probably offer both a stable link to the latest version and a timestamped 'permalink' to a specific dump run. For timestamping, we could use either the file name (as is done for wikitext dumps), or the directory. One thing to consider is that SQLite dumps can be updated incrementally by the user, so a timestamped filename might become inaccurate on update.

@Hydriz, @Kelson, @ArielGlenn: What is your take on this? Do you have / know of a particular workflow that favors continuing the use of the exact same pattern for sqlite-based HTML dumps?

@GWicke Here are some of my comments on the points that you have raised above:

1. dbname/domain
For the dbname/domain part, I am okay with operating on both. However, I considered the current situation where all our dumps (regardless of wikitext or other miscellaneous ones) are using the format "$dbname-$date". While there aren't any current users of these HTML dumps, I am in favor of maintaining consistency to avoid raising additional confusion.

2. Time stamp
I think this is the more important section. By including the date that the dump has been produced, it makes it easier for people to know when a version of the dump was produced. It helps in comparing between two versions from different dates and partially for me to archive the dumps with the confirmation on when the dump was produced.

If the HTML dumps are being made incrementally where an older dump is immediately replaced with a newer one, it will make it hard for data archivists and researchers to archive/analyse the older dumps.

If there is only one file produced per wiki, I rather we just store all the versions of the dump in the same directory, separating them by different directories for each wiki respectively. Creating a new directory for each dump date is unnecessary in my opinion.

+1 on Hydriz (and thanks for looking into storage/archiving already).

@GWicke Great work but I think en.wikipedia.org.articles.ns0.sqlite3 is incomplete, the current article count is 4,857,103:

$ ls -la en.wikipedia.org.articles.ns0.sqlite3 
-r--r--r--. 1 xx xx 155047293952 Apr 22 08:45 en.wikipedia.org.articles.ns0.sqlite3
$ sqlite3 en.wikipedia.org.articles.ns0.sqlite3 
SQLite version 3.6.20
sqlite> select title from data order by title desc limit 5;
Sam_Tung_Uk_Museum
Sam_Tuivailala
Sam_Tuitupou
Sam_Tucker
Sam_Tsui
sqlite> select count(revision) from data;         
3826478

@Kubina, I suspect the {{NUMBEROFARTICLES}} return of 4,857,103 includes redirects, while the dump currently excludes them. We could consider adding a 'is_redirect' flag (or even just the destination title), and then include redirects in the dump.

@GWicke While 4,857,103 includes redirects sorting by the titles shows the last title of an article in the database is Sam_Tung_Uk_Museum and querying for the title "Water" or "U2" does not return anything. I think all the articles after Sam_Tung_Uk_Museum are missing.

$ sqlite3 en.wikipedia.org.articles.ns0.sqlite3 
SQLite version 3.6.20
sqlite> select count(title) from data where (title="Water");
0
sqlite> select count(title) from data where (title="U2");
0
sqlite> select count(title) from data where (title="Sam_Tung_Uk_Museum");
1

@Kubina: Ah, that is indeed an issue. Good catch! There was no error reported, so clearly we need to check if errors in title paging are properly handled.

Some comments after looking a bit at the dumps produced.

Only ns 0 is included; while there is no need to include templates and lua modules and so forth given that the pages are already rendered, adding in the project and portal pages might be nice.

Redirects should be included in the dump as was mentioned above; users may want those for search purposes if nothing else.

One of the things the 7Z and bz2 files give dumps users is the ability to work with small enough files, streaming them to pull out data they desire or in the case of the bz2 dumps random access. In fact I have simple tools for use with the multistream bz2 files for that very purpose. The compressed sqlite db format makes this impossible; the file must be compressed before it can be used.

We should bear in mind that the current compressed sqlite format is probably unworkable for full revision history dumps. The uncompressed file generated first would be truly ginormous (10T of space 2 years ago for the corresponding XML files, though that does include all namespaces), and I guess most users wouldn't want to be working with files that large.

For all but the large wikipedias however, and for current revs only, this format seems good to me for people that want a database format for running queries.

Some comments after looking a bit at the dumps produced.

Only ns 0 is included; while there is no need to include templates and lua modules and so forth given that the pages are already rendered, adding in the project and portal pages might be nice.

Since a lot of users will probably want the articles only, I think it makes sense to offer the other namespaces separately.

One of the things the 7Z and bz2 files give dumps users is the ability to work with small enough files, streaming them to pull out data they desire or in the case of the bz2 dumps random access. In fact I have simple tools for use with the multistream bz2 files for that very purpose. The compressed sqlite db format makes this impossible; the file must be compressed before it can be used.

With SQLite we trade streaming against convenient random access. I believe most users will find this more useful, as it avoids the need to convert XML dumps into a random-access format first. The ZIM format should be a good complement for users who can't affort 200G of disk space, as it enables random access without decompressing the full dump.

We should bear in mind that the current compressed sqlite format is probably unworkable for full revision history dumps. The uncompressed file generated first would be truly ginormous (10T of space 2 years ago for the corresponding XML files, though that does include all namespaces), and I guess most users wouldn't want to be working with files that large.

We could keep the size of individual dbs manageable by splitting full-history dumps by page id. With a more efficient way to retrieve changes, it should be fairly efficient to update those dumps in place.

We currently split the history dumps into chunks by page id range in fact which with some fiddling works pretty well. I guess you may want more pieces than we generate, since users will be working with them uncompressed.

The thing about streaming is that a tyipical thing folks do with these dumps is to grab a subset of articles, by running the compressed dump through a perl/python/languageofyourchoice script to select what they want. That's going to be pretty inconvenient with the new format, though for queries it is of course much better.

Another thing folks do is set up or update small mirror sites; that means converting the xml files to importable sql, another thing that would be pretty annoying to do with this new format.

One size probably doesn't fit all.

We could create a script for producing an XML stream from a SQLite or ZIM dump, for the benefit of existing tools. With ZIM we should be able to avoid unpacking the dump as well.

Unpacking also doesn't take that long with parallelized xz utils or pixz. Enwiki articles (current revisions) use ~181G of disk space, which isn't too bad with current disk sizes and compressing filesystems.

@Kubina, enwiki after a re-run, still without redirects:

sqlite> select count(revision) from data;
4860316

I just pushed a change to include redirects as well. Preliminary dumps have now moved to http://dumps.wikimedia.org/htmldumps/dumps/. Overhauling the naming will be one of the next steps.

I admit to having already planned for names of the format wikiname-yyyymmdd-stuff.ext where stuff has whatever you want to put in it. I often have multiple downloads of a certain small wiki lying around in the same directory so I like to have the date right in the name. I don't care if the name is the fqdn or the wiki name, whatever's convenient.

@ArielGlenn, a more precise timestamp might be useful in case we do multiple dumps per day / re-run a dump. Otherwise that format works for me.

@GWicke I think this is great. I took a look at a few articles in http://dumps.wikimedia.org/htmldumps/dumps/simple.wikipedia.org.articles.ns0.sqlite3.xz and it looked fine. I particularly like how these dumps includes other small details like category, indicator and alt text. Very nice work.

I noticed one issue with the table's data. Only title, revision, and body columns have data. The rest are all blank. You can run the following SQL, and see only one empty row gets returned.

SELECT tid, page_id, namespace, timestamp, comment, user_name, user_id FROM data GROUP BY tid, page_id, namespace, timestamp, comment, user_name, user_id;

Also, I'd vote to offer redirects and other namespaces as well. As Ariel mentioned, Project and Portal, but also Category, File and Help. I even think that dumping Template, Module, and MediaWiki could have some utility value.

Perhaps for big dumps, a version of the sqlite3 file could be offered where the 'body' is replaced by a zlib-compressed BLOB (say, 'zbody')? That'd allow scripted random-access without requiring 200GB+ expansion all at once.

ArielGlenn claimed this task.

I'm going to close this as long since settled; we can always add new formats or new content as needed.