Page MenuHomePhabricator

[Data Quality] Update data_quality schemas to be compatible with Iceberg tables
Open, Needs TriagePublic3 Estimated Story Points

Description

The data_quality schema (metrics and alerts) contain a partition_id field representing the partition being analyzed.
In Iceberg partitions still exist but are not representative of regular updates as they currently are.
We need to update the data_quality schema to be workable with Iceberg.

Event Timeline

Ahoelzl renamed this task from Update data_quality schemas to be compatible with Iceberg tables to [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.Feb 7 2024, 3:31 PM
Ahoelzl set the point value for this task to 1.
Ahoelzl edited projects, added Data-Engineering (Sprint 8); removed Data-Engineering.
Ahoelzl changed the point value for this task from 1 to 3.Feb 7 2024, 4:00 PM

Spoke a bit about this with @xcollazo.

There's an API available for accessing partition metadata, which can be utilized to generate IDs compatible with the current partition_id format:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.

My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).

Spoke a bit about this with @xcollazo.

There's an API available for accessing partition metadata, which can be utilized to generate IDs compatible with the current partition_id format:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

This one https://iceberg.apache.org/docs/latest/spark-queries/#partitions

However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.

Spark SQL access is only available 3.3+. Programatic access via the Iceberg Java API IIRC can be done with our current Spark 3.1.

My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).

Their prefered approach is to leverage Iceberg table branches and do DQ via "Write-Audit-Publish". This article from Tabular.io is a good read. From that article:

Write the data – Commit the changes to a staging branch (instead of main);
Audit the data – Run data quality checks and other validation against the branch;
Publish the data – If checks pass, fast-forward the staging branch onto the main branch.