Page MenuHomePhabricator

Spike - Slowly Changing Dimensions on Druid
Closed, DeclinedPublic13 Estimated Story Points

Description

The schemas the analytics team foresee for edit data involves managing slowly changing dimensions (https://en.wikipedia.org/wiki/Slowly_changing_dimension)
This spike is about understanding how Druid handles those and what is our best shoot from a schema perspective.

This task should be time-bound.

Event Timeline

Nuria set the point value for this task to 13.

Two approaches:

  • try to load data from Altiscale that @JAllemandou already has in a single flat denormalized schema. We'll query this in Hive and Druid when the cluster is up.
  • try to load data in separate Druid datasets (split up the Altiscale data). Querying this would be more complicated but we can compare the performance

Once we have performance numbers from denormalized, normalized, and Hive, we can think about next steps.

Test data is revision oriented and based on this schema:

id BIGINT,
timestamp STRING,
page_id BIGINT,
page_title STRING,
page_namespace BIGINT,
page_redirect STRING,
page_restrictions ARRAY<STRING>,
user_id BIGINT,
user_user_text STRING,
minor BOOLEAN,
comment STRING,
bytes BIGINT,
sha1 STRING,
parent_id BIGINT,
model STRING,
format STRING
Milimetric triaged this task as Medium priority.Aug 8 2016, 4:52 PM
Milimetric subscribed.

Idea: test this on the pageview datasource already loaded, making a lookup table for the Chrome 41 bug or something else.

This was not resolved, we never loaded slowly changing dimensions the way we imagined here. It's fine if we decide we no longer want to do that, but then we should set it to "Declined".