Page MenuHomePhabricator

What format conversion and final outputs should we provide for dumps?
Open, MediumPublic

Description

Some folks stream their output into a script for processing, others process it in parallel feeding it to hadoop, still others shovel it into a db; what output formats and converters can we provide to support all these cases?

Event Timeline

ArielGlenn triaged this task as Medium priority.Mar 6 2016, 1:03 PM

I always convert an XML dump into a stream of denormalized revisions when processing.

E.g.

  1. the mwxml library provides a nice interface for iterating over revisions.
  2. I generally convert to JSON line format ("revdocs") when processing XML dumps in hadoop

I sometimes use the ordering inherent in the XML dumps to process revision histories, but that's hard to reproduce independently since ID/timestamp ordering is not the same. Whenever predictable ordering is critical, I'll re-order the whole dataset on (page_id, timestamp, rev_id).