Goals
- Table management (creation, evolution) should be its own airflow task
- Refine code refactored to support Hive and Iceberg
- Dynamic Airflow dag to launch Refine jobs
Table Management Story
As a software engineer, I would like the capability to programmatically manage schemas of Refined tables.
As part of the refine refactor, we should extract table management into a dedicated tool.
The tool should be able to.
- read from config dbs and schema that should exist in metastore (jsonschema vs calcite?)
- db -> schema URI
- The tool should do multiple things
- validate table existence
- validate coherence
- migrate the schema
- update table properties
- be executable at the CLI and via airflow
Success criteria
- Refine Scala code refactor & tests
- Airflow dag to schedule refine jobs
- Manual dryrun on currently refined tables, compare outputs: T361502: [Refine Refactoring] Define and implement a automated testing / comparison tool for config store configured datasets
- optimize parallelization from Airflow