Some folks stream their output into a script for processing, others process it in parallel feeding it to hadoop, still others shovel it into a db; what output formats and converters can we provide to support all these cases?
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T128519 Dumps 2.0 Flexibility design questions | |||
| Open | None | T129022 What format conversion and final outputs should we provide for dumps? |
Event Timeline
Comment Actions
I always convert an XML dump into a stream of denormalized revisions when processing.
E.g.
- the mwxml library provides a nice interface for iterating over revisions.
- I generally convert to JSON line format ("revdocs") when processing XML dumps in hadoop
I sometimes use the ordering inherent in the XML dumps to process revision histories, but that's hard to reproduce independently since ID/timestamp ordering is not the same. Whenever predictable ordering is critical, I'll re-order the whole dataset on (page_id, timestamp, rev_id).