Page MenuHomePhabricator

Transform to XML-->JSON in sorted file format [8 pts]
Closed, ResolvedPublic


Use-case: Page-order text processing:

Implement a (probably Crunch-based) sorted XML-->JSON ETL. Each output file should contain whole pages (partition key = "page_id") in sorted order (sort by "timestamp" and then "rev_id"). We'll use this dataset to perform page-level metrics extraction. We do many different types of page-level metrics extraction (e.g. diffing, extraction of <ref> tags, etc.) Because we'll be sorting on the ETL, we won't need to sort in any of the many subsequent passes over the dataset.

Event Timeline

Halfak created this task.Aug 11 2015, 1:55 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a project: Analytics.
Halfak added subscribers: Halfak, JAllemandou.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2015, 1:55 PM
JAllemandou added a project: Analytics-Backlog.
JAllemandou set Security to None.
JAllemandou triaged this task as Normal priority.Aug 25 2015, 3:37 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics-Backlog.
JAllemandou renamed this task from Transform to XML-->JSON in sorted file format to Transform to XML-->JSON in sorted file format [8 pts].Aug 28 2015, 3:39 PM
kevinator closed this task as Resolved.Sep 9 2015, 9:51 PM
kevinator added a subscriber: kevinator.