Page MenuHomePhabricator

Spike - Slowly Changing Dimensions on Druid
Closed, DeclinedPublic13 Story Points

Description

The schemas the analytics team foresee for edit data involves managing slowly changing dimensions (https://en.wikipedia.org/wiki/Slowly_changing_dimension)
This spike is about understanding how Druid handles those and what is our best shoot from a schema perspective.

This task should be time-bound.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 9 2016, 5:34 PM
Nuria updated the task description. (Show Details)May 12 2016, 5:04 PM
Nuria updated the task description. (Show Details)May 12 2016, 5:07 PM
Nuria set the point value for this task to 13.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Two approaches:

  • try to load data from Altiscale that @JAllemandou already has in a single flat denormalized schema. We'll query this in Hive and Druid when the cluster is up.
  • try to load data in separate Druid datasets (split up the Altiscale data). Querying this would be more complicated but we can compare the performance

Once we have performance numbers from denormalized, normalized, and Hive, we can think about next steps.

Test data is revision oriented and based on this schema:

id BIGINT,
timestamp STRING,
page_id BIGINT,
page_title STRING,
page_namespace BIGINT,
page_redirect STRING,
page_restrictions ARRAY<STRING>,
user_id BIGINT,
user_user_text STRING,
minor BOOLEAN,
comment STRING,
bytes BIGINT,
sha1 STRING,
parent_id BIGINT,
model STRING,
format STRING
Nuria moved this task from Paused to Next Up on the Analytics-Kanban board.Jul 21 2016, 4:50 PM
Milimetric triaged this task as Normal priority.Aug 8 2016, 4:52 PM
Milimetric added a subscriber: Milimetric.

Idea: test this on the pageview datasource already loaded, making a lookup table for the Chrome 41 bug or something else.

Milimetric moved this task from Operational Excellence Future to Dashiki on the Analytics board.
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Sep 26 2016, 3:40 PM
Nuria closed this task as Resolved.Mar 20 2017, 3:32 PM
Milimetric reopened this task as Open.Mar 22 2017, 5:19 PM

This was not resolved, we never loaded slowly changing dimensions the way we imagined here. It's fine if we decide we no longer want to do that, but then we should set it to "Declined".

Nuria added a comment.Mar 23 2017, 6:27 PM

Setting to "declined".

Nuria closed this task as Declined.Mar 23 2017, 6:28 PM