The data_quality schema (metrics and alerts) contain a partition_id field representing the partition being analyzed.
In Iceberg partitions still exist but are not representative of regular updates as they currently are.
We need to update the data_quality schema to be workable with Iceberg.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | VirginiaPoundstone | T345988 [Epic] XML MediaWiki data dumps for right to fork | |||
Open | xcollazo | T358877 Dumps 2.0 - Production intermediate table milestone | |||
Open | gmodena | T356866 [Data Quality] Update data_quality schemas to be compatible with Iceberg tables |
Event Timeline
Spoke a bit about this with @xcollazo.
There's an API available for accessing partition metadata, which can be utilized to generate IDs compatible with the current partition_id format:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.
My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).
This one https://iceberg.apache.org/docs/latest/spark-queries/#partitions
However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.
Spark SQL access is only available 3.3+. Programatic access via the Iceberg Java API IIRC can be done with our current Spark 3.1.
My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).
Their prefered approach is to leverage Iceberg table branches and do DQ via "Write-Audit-Publish". This article from Tabular.io is a good read. From that article:
Write the data – Commit the changes to a staging branch (instead of main); Audit the data – Run data quality checks and other validation against the branch; Publish the data – If checks pass, fast-forward the staging branch onto the main branch.