Page MenuHomePhabricator

[Refine Simplification] Remove Schema Merging in Refine Process by Enforcing Backward Compatibility
Open, Needs TriagePublic

Description

should be done after T381072

The current Refine process involves merging the Hive table schema with the schema from the schema repository. While this was historically necessary to accommodate schema evolution and non-backward-compatible changes, it introduces complexity into the code and potential edge cases.

With the new enforcement of schema backward compatibility, this merging step is now redundant. By aligning the table schema strictly with the schema repository, we can simplify the Refine process, improve maintainability, and reduce the potential for schema mismatches.

Proposed Changes:

  1. Remove the convertToSchema/merge step from the Refine process.
  2. Ensure the Hive table schema strictly matches the schema repository definition, including subfields order.
  3. Gradually evolve and align table schemas to match the schema repository for smooth transition and testing.

Benefits:

• Reduced complexity in the Refine codebase.
• Increased reliability by eliminating the potential for mismatched schemas.

Risks and Mitigation:

• Risk: Legacy tables with schema mismatches could fail during the transition.
• Mitigation: Incrementally align existing table schemas with the schema repository before rolling out the change.

Next Steps:

  1. Reuse previously developed testing scripts to validate the alignment of table schemas with the schema repository.
  2. Run a dry-run to identify and address any existing mismatches.
  3. Roll out the updated Refine process with cautious monitoring.

Event Timeline