Page MenuHomePhabricator

Add editors_monthly data cube to Druid
Closed, DuplicatePublic

Description

We have now added an edits_hourly cube to Druid (T211173), but while that makes it possible to count edits, it doesn't make it possible to count distinct editors. We can't simply add a user_name column to that cube, because Druid is not well suited to columns with many distinct values.

Instead, we should create a separate cube where each separate row corresponds to the aggregate behavior of a single editor on a single wiki during a single month (essentially, an editor-month dataset, but with a much richer schema than described on that page).

Some initial thoughts:

  • I currently use an editor-month dataset (neilpquinn.editors_monthly) to calculate active editors for movement metrics. We should make sure this dataset is available in Hive and use it for the movement metrics.
  • With the cube set up in this fashion, it would not be possible for Turnilo users to calculate the global number of active editors, because the rows will be split by wiki and won't actually identify the users concerned (just as edits_hourly doesn't actually identify the edits concerned). If this is a serious concern, we can add a separate cube where a single row corresponds to a single editor across all wikis during a single month.

Event Timeline

Neil_P._Quinn_WMF created this task.

@MNeisler, assigning you because this seems like a natural continuation of your work; unassign yourself if I'm missing something 😊

kzimmerman removed MNeisler as the assignee of this task.Jul 9 2019, 8:38 PM
kzimmerman triaged this task as Normal priority.
kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.

I think this will go to Connie when she joins; unassigning Megan