Page MenuHomePhabricator

Add editors_monthly data cube to Druid
Closed, DuplicatePublic

Description

We have now added an edits_hourly cube to Druid (T211173), but while that makes it possible to count edits, it doesn't make it possible to count distinct editors. We can't simply add a user_name column to that cube, because Druid is not well suited to columns with many distinct values.

Instead, we should create a separate cube where each separate row corresponds to the aggregate behavior of a single editor on a single wiki during a single month (essentially, an editor-month dataset, but with a much richer schema than described on that page).

Some initial thoughts:

  • I currently use an editor-month dataset (neilpquinn.editors_monthly) to calculate active editors for movement metrics. We should make sure this dataset is available in Hive and use it for the movement metrics.
  • With the cube set up in this fashion, it would not be possible for Turnilo users to calculate the global number of active editors, because the rows will be split by wiki and won't actually identify the users concerned (just as edits_hourly doesn't actually identify the edits concerned). If this is a serious concern, we can add a separate cube where a single row corresponds to a single editor across all wikis during a single month.