Objective: Aggregate Pageview counts into an intermediate easy/quick queryable state (Intermediate Aggregates)
Key Result: Intermediate aggregates are used to generate data to the Pageview API and other useful cubes. Data is available starting in May.
Story: someone taking on an analyst role runs an SQL query to get pageview numbers for Executives, FR or Communications.
For example: "What were the monthly pageviews in France, excluding spiders?"
2 dataset are produced:
Intermediate Aggregates | https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly | starting May 2015 |
Project Level Aggregates | https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly | starting April 2015 |
Backfill Data for April
T96067: Compute pageviews aggregates daily and monthly from April {wren}
Tasks to setup Impala (serving layer on the cluster that can be queried)
T96328: setup 'testing' dataset on hive for Impala {wren} [13 pts]
T96329: Install Impala on cluster {wren}
T96330: test performance of Impala {wren} [8 pts]
T96331: Productionize Impala {hawk}