Flatten columns on hourly target table. Tune partitioning strategy.
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	xcollazo
	Jun 30 2023, 3:24 PM

Description

On T335860, we implemented a pyspark job that runs a MERGE INTO that transforms event data into a table that will eventually have all the mediawiki revision history.

Since we don't understand completely downstream consumers of that table, we deferred optimizing its schema.

In this task we should think about these downstream consumers and flatten and/or tune the schema and partitioning for their benefit.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
		Duplicate		None	T340856 Flatten columns on hourly target table. Tune partitioning strategy.