For the moment, the `mediawiki_history_reduced` that we load in druid to serve WKS2 API is built using hive, and uses revision-events, page-events, user-events, and page and user daily and monthly digests events. If we want to go for full digests (no more revision, page and user events, only page and user daily and monthly digests), it means denormalization over many dimensions. This also means a lot of data duplication and therefore a huge dataset, not even thinking of the hive request.
Idea is to use spark and to take advantage of our data being pretty denormalized by default: in many cases, the event itself is also it's aggregated value for a day over many dimensions.
In spark, idea would be to bundle into an array of string all the aggregated values that are covered by an event without duplicating it, preventin therefore a lot of data duplication.