For the moment, the mediawiki_history_reduced that we load in druid to serve WKS2 API is built using hive, and uses revision-events, page-events, user-events, and page and user daily and monthly digests events.
If we want to go for full digests (no more revision, page and user events, only page and user daily and monthly digests), it means a big denormalization over many dimensions. This also means a lot of event duplication and therefore a huge dataset, not even thinking of the hive request.
The idea behind this task is to take advantage of our data being pretty denormalized by default: in many cases, the event itself is also it's aggregated value for a day over many dimensions. In spark, we could bundle into an array of string all the aggregated values that are covered by an event, without duplicating the rest of the event.