Page MenuHomePhabricator

Set up edit_hourly data set in Hive
Closed, ResolvedPublic13 Story Points

Description

Add a new Hive data set named edit_hourly following this spec:
https://docs.google.com/document/d/1jzrE3xdyEHed4Ek5ORRedOlEeH-i111hdmG3tBTF8QU
This task includes:

  • Oozie coordinator that runs every month after mediawiki_history_denormalized has succeeded, and populates edit_hourly table since the beginning of time, adding a new snapshot partition
  • Add table for deletion to refinery-drop-mediawiki-snapshots.
  • Documentation for the data set in Wikitech

Details

Related Gerrit Patches:

Event Timeline

mforns created this task.Apr 4 2019, 11:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2019, 11:52 AM

Change 501197 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add edit_hourly oozie job

https://gerrit.wikimedia.org/r/501197

mforns updated the task description. (Show Details)Apr 4 2019, 2:07 PM

Change 501328 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add edit_hourly to list of tables to be purged of old snapshots

https://gerrit.wikimedia.org/r/501328

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 4 2019, 4:11 PM
fdans triaged this task as High priority.Apr 4 2019, 5:00 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
mforns added a comment.Apr 5 2019, 2:32 PM

Thanks @Neil_P._Quinn_WMF! Forgot to do that.
As you see, we had a slight change of plans in the implementation.
We encountered and issue in Druid, which does not allow to apply transforms to fields that are not listed as dimensions, for hive tables stored in parquet format.
So we decided to create this intermediate table in Hive called edit_hourly (maybe edit_daily, if we find that hourly reveals poor performance).
This way we won't need to use druid transforms (transforms will happen in Hadoop via HiveSQL).
Also, we can take advantage of having the Hive version of the data set for more detailed querying.
Druid developers are fixing this issue in the new version, but it will still take some time until we upgrade to that.
In any case, it won't harm to have that intermediate table in Hive.

Change 501197 merged by Mforns:
[analytics/refinery@master] Add edit_hourly oozie job

https://gerrit.wikimedia.org/r/501197

Change 501328 merged by Mforns:
[analytics/refinery@master] Add edit_hourly to list of tables to be purged of old snapshots

https://gerrit.wikimedia.org/r/501328

Nuria closed this task as Resolved.May 14 2019, 8:43 PM
Nuria set the point value for this task to 5.
mforns updated the task description. (Show Details)May 15 2019, 2:44 PM
mforns changed the point value for this task from 5 to 13.