Page MenuHomePhabricator

Use hive metastore when registering views
Open, Needs TriagePublic

Description

Mediawiki History jobs use a view registration that skips the hive metastore and looks at the underlying data directly, superimposing their own schema. If the schema of the underlying data changes, this can cause problems. See for example the patches on the parent task, T350489.

The solution would be to use select statements that explicitly grab data through the schema as defined in the Hive metastore. This would allow forward-compatibility with backwards-compatible schema changes.

Event Timeline

The mediawiki-history job was built at a time our spark and hive integration was not so good due to versions of hive mismatch. We overcame the issue by reading parquet files directly instead of their hive counterpart. We now should be able to read from Hive and prevent the problem of schema changes.