Page MenuHomePhabricator

Set up edit_hourly data set in Hive
Closed, ResolvedPublic13 Estimated Story Points

Description

Add a new Hive data set named edit_hourly following this spec:
https://docs.google.com/document/d/1jzrE3xdyEHed4Ek5ORRedOlEeH-i111hdmG3tBTF8QU
This task includes:

  • Oozie coordinator that runs every month after mediawiki_history_denormalized has succeeded, and populates edit_hourly table since the beginning of time, adding a new snapshot partition
  • Add table for deletion to refinery-drop-mediawiki-snapshots.
  • Documentation for the data set in Wikitech

Event Timeline

Change 501197 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add edit_hourly oozie job

https://gerrit.wikimedia.org/r/501197

Change 501328 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add edit_hourly to list of tables to be purged of old snapshots

https://gerrit.wikimedia.org/r/501328

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Thanks @Neil_P._Quinn_WMF! Forgot to do that.
As you see, we had a slight change of plans in the implementation.
We encountered and issue in Druid, which does not allow to apply transforms to fields that are not listed as dimensions, for hive tables stored in parquet format.
So we decided to create this intermediate table in Hive called edit_hourly (maybe edit_daily, if we find that hourly reveals poor performance).
This way we won't need to use druid transforms (transforms will happen in Hadoop via HiveSQL).
Also, we can take advantage of having the Hive version of the data set for more detailed querying.
Druid developers are fixing this issue in the new version, but it will still take some time until we upgrade to that.
In any case, it won't harm to have that intermediate table in Hive.

Change 501197 merged by Mforns:
[analytics/refinery@master] Add edit_hourly oozie job

https://gerrit.wikimedia.org/r/501197

Change 501328 merged by Mforns:
[analytics/refinery@master] Add edit_hourly to list of tables to be purged of old snapshots

https://gerrit.wikimedia.org/r/501328

Nuria set the point value for this task to 5.
mforns changed the point value for this task from 5 to 13.