Page MenuHomePhabricator

Create job that backfills Pagecounts-EZ (2011 - 2016) data via hadoop correcting issues
Closed, ResolvedPublic8 Estimated Story Points

Description

The idea here is to ingest old pagecounts-ez files into hive and generate new and shiny files that solve the following problems:

Event Timeline

Change 596605 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery/source@master] Add special explode UDTF that turns EZ-style hourly strings into rows

https://gerrit.wikimedia.org/r/596605

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 597541 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Add Pageviews Complete dumps backfilling job

https://gerrit.wikimedia.org/r/597541

Is our plan to maintain the data in hadoop tables as well? So the data ingested will remain?

I think unless we have a good reason to keep it, data should only be kept as long as it's useful to generate the dumps. We can't immediately drop it because each day needs the previous day's data in order to obtain hour 0.

Although now I'm thinking that the job could delete the day partition and leave the hour 0 in the next day's partition untouched... yesyesyes

Change 597740 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery/source@master] Add UDF that transforms Pagecounts-EZ projects into standard

https://gerrit.wikimedia.org/r/597740

Change 596605 merged by jenkins-bot:
[analytics/refinery/source@master] Add special explode UDTF that turns EZ-style hourly strings into rows

https://gerrit.wikimedia.org/r/596605

Change 597740 merged by jenkins-bot:
[analytics/refinery/source@master] Add UDF that transforms Pagecounts-EZ projects into standard

https://gerrit.wikimedia.org/r/597740

Change 597541 merged by Joal:
[analytics/refinery@master] Add pageview historical dumps backfilling job

https://gerrit.wikimedia.org/r/597541

Nuria set the point value for this task to 8.

Just curious, but is this data supposed to be in the pageview_historical table now? I only see data for 2014 and this coordinator that filled it: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0010055-200608135941564-oozie-oozi-C/ but not anything before that