Page MenuHomePhabricator

Flatten columns on hourly target table. Tune partitioning strategy.
Closed, DuplicatePublic

Description

On T335860, we implemented a pyspark job that runs a MERGE INTO that transforms event data into a table that will eventually have all the mediawiki revision history.

Since we don't understand completely downstream consumers of that table, we deferred optimizing its schema.

In this task we should think about these downstream consumers and flatten and/or tune the schema and partitioning for their benefit.