Page MenuHomePhabricator

Implement digest-only mediawiki_history_reduced dataset in spark
Open, HighPublic

Description

For the moment, the mediawiki_history_reduced that we load in druid to serve WKS2 API is built using hive, and uses revision-events, page-events, user-events, and page and user daily and monthly digests events.

If we want to go for full digests (no more revision, page and user events, only page and user daily and monthly digests), it means a big denormalization over many dimensions. This also means a lot of event duplication and therefore a huge dataset, not even thinking of the hive request.

The idea behind this task is to take advantage of our data being pretty denormalized by default: in many cases, the event itself is also it's aggregated value for a day over many dimensions. In spark, we could bundle into an array of string all the aggregated values that are covered by an event, without duplicating the rest of the event.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2017, 11:14 AM
fdans moved this task from Incoming to Wikistats on the Analytics board.Dec 4 2017, 5:06 PM
JAllemandou updated the task description. (Show Details)Jan 3 2018, 12:44 PM
fdans moved this task from Wikistats to Backlog (Later) on the Analytics board.Mar 29 2018, 5:16 PM
Milimetric moved this task from Backlog (Later) to Incoming on the Analytics board.Mar 9 2020, 4:27 PM
Milimetric added a subscriber: Milimetric.

When we triage this next, let's define/task our solution from Offsite Mallorca and assign/prioritize.

fdans triaged this task as High priority.Mar 30 2020, 4:34 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.