Change Details

For the moment, the `mediawiki_history_reduced` that we load in druid to serve WKS2 API is built using hive, and uses `revision-events`, `page-events`, `user-events`, and page and user daily and monthly `digests` events. If we want to go for full digests (no more revision, page and user events, only page and user daily and monthly digests), it means a big denormalization over many dimensions. This also means a lot of dataevent duplication and therefore a huge dataset, not even thinking of the hive request. I The idea is to use spark andbehind this task is to take advantage of our data being pretty denormalized by default: in many cases, the event itself is also it's aggregated value for a day over many dimensions. In spark, idea wwe could be to bundle into an array of string all the aggregated values that are covered by an event without duplicating it, preventin therefore a lowithout duplicating the rest of data duplicationthe event.