Page MenuHomePhabricator

Implement digest-only mediawiki_history_reduced dataset in spark
Open, LowPublic

Description

For the moment, the mediawiki_history_reduced that we load in druid to serve WKS2 API is built using hive, and uses revision-events, page-events, user-events, and page and user daily and monthly digests events.

If we want to go for full digests (no more revision, page and user events, only page and user daily and monthly digests), it means a big denormalization over many dimensions. This also means a lot of event duplication and therefore a huge dataset, not even thinking of the hive request.

The idea behind this task is to take advantage of our data being pretty denormalized by default: in many cases, the event itself is also it's aggregated value for a day over many dimensions. In spark, we could bundle into an array of string all the aggregated values that are covered by an event, without duplicating the rest of the event.

Event Timeline

Milimetric subscribed.

When we triage this next, let's define/task our solution from Offsite Mallorca and assign/prioritize.

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
odimitrijevic lowered the priority of this task from High to Low.Jan 6 2022, 5:09 AM
odimitrijevic subscribed.

@JAllemandou Is this still relevant? Is this something that can be marked as a good starter task provided there is benefit to doing it?

The reason behind this task is still relevant, but not clear through the parent tasks: we wish/need to define/agree on/implement a technical solution allowing us to tackle changing the dimensions by which we slice and dice our edit data (druid vs cassandra, digests vs flat, tooling). The task itself is about implementing one solution, which is rather complicated, so I wouldn't flag it as good starter in any case.