Webrequest-refine currently shuffles both raw and augmented data to enforce rows being distinct. As augmented values are computed deterministically the distinct part can be enforced using only raw data, therefore preventing having augmented-data being shuffled between mappers and reducers (network + disk IOs reduction).
Description
Description
Details
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
analytics/refinery | master | +107 -25 | Improve webrequest-refine query |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T265487 Review recurrent Hadoop worker disk saturation events | |||
Resolved | JAllemandou | T267008 Improve webrequest-refine shuffle-sort |
Event Timeline
Comment Actions
Change 638086 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Improve webrequest-refine query shuffle stage
Comment Actions
Change 638086 merged by Joal:
[analytics/refinery@master] Improve webrequest-refine query