[Data Quality] Update data_quality schemas to be compatible with Iceberg tables
Open, Needs TriagePublic3 Estimated Story Points
Actions

Assigned To

Authored By

	JAllemandou
	Feb 7 2024, 2:13 PM

Description

The data_quality schema (metrics and alerts) contain a partition_id field representing the partition being analyzed.
In Iceberg partitions still exist but are not representative of regular updates as they currently are.
We need to update the data_quality schema to be workable with Iceberg.

Related Objects
Search...

Status	Assigned	Task
Open	VirginiaPoundstone	T345988 [Epic] XML MediaWiki data dumps for right to fork
Open	xcollazo	T358877 Dumps 2.0 - Production intermediate table milestone
Open	gmodena	T356866 [Data Quality] Update data_quality schemas to be compatible with Iceberg tables

Event Timeline

JAllemandou created this task.Feb 7 2024, 2:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2024, 2:13 PM

Ahoelzl renamed this task from Update data_quality schemas to be compatible with Iceberg tables to [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.Feb 7 2024, 3:31 PM

Ahoelzl set the point value for this task to 1.

Ahoelzl edited projects, added Data-Engineering (Sprint 8); removed Data-Engineering.

Ahoelzl changed the point value for this task from 1 to 3.Feb 7 2024, 4:00 PM

Ahoelzl moved this task from Sprint 8 to Sprint 9 on the Data-Engineering board.Feb 7 2024, 6:25 PM

Ahoelzl edited projects, added Data-Engineering (Sprint 9); removed Data-Engineering (Sprint 8).

Spoke a bit about this with @xcollazo.

There's an API available for accessing partition metadata, which can be utilized to generate IDs compatible with the current partition_id format:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.

My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).

In T356866#9546027, @gmodena wrote:

Spoke a bit about this with @xcollazo.

There's an API available for accessing partition metadata, which can be utilized to generate IDs compatible with the current partition_id format:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

This one https://iceberg.apache.org/docs/latest/spark-queries/#partitions

However, if I understood correctly, programmatic access to this API requires Spark 3.3 or higher.

Spark SQL access is only available 3.3+. Programatic access via the Iceberg Java API IIRC can be done with our current Spark 3.1.

My understanding is that Iceberg encourages a shift in perspective towards snapshots and merges, rather than focusing solely on filesystem partitions. IMO the challenge lies in integrating Iceberg sensors into airflow-dags. The use of partition metadata in data quality (DQ) jobs serves as a reference point for the processed data (currently forwarded by the partition sensor).

Their prefered approach is to leverage Iceberg table branches and do DQ via "Write-Audit-Publish". This article from Tabular.io is a good read. From that article:

Write the data – Commit the changes to a staging branch (instead of main);
Audit the data – Run data quality checks and other validation against the branch;
Publish the data – If checks pass, fast-forward the staging branch onto the main branch.

xcollazo added a parent task: T358877: Dumps 2.0 - Production intermediate table milestone.Mar 1 2024, 5:25 PM

gmodena claimed this task.Mar 4 2024, 2:03 PM

lbowmaker moved this task from Sprint 9 to Q4 2024 April 1st - June 30th on the Data-Engineering board.Apr 1 2024, 12:04 PM

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering (Sprint 9).

[Data Quality] Update data_quality schemas to be compatible with Iceberg tablesOpen, Needs TriagePublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

[Data Quality] Update data_quality schemas to be compatible with Iceberg tables
Open, Needs TriagePublic3 Estimated Story Points
Actions

Related Objects
Search...