Use-case: Page-order text processing:
Implement a (probably Crunch-based) sorted XML-->JSON ETL. Each output file should contain whole pages (partition key = "page_id") in sorted order (sort by "timestamp" and then "rev_id"). We'll use this dataset to perform page-level metrics extraction. We do many different types of page-level metrics extraction (e.g. diffing, extraction of <ref> tags, etc.) Because we'll be sorting on the ETL, we won't need to sort in any of the many subsequent passes over the dataset.