Page MenuHomePhabricator

[NEEDS GROOMING] deequ repo should be instantiated from Wikimedia's DQ metrics store
Open, Needs TriagePublic

Description

To compute stateful metrics, that is metrics that depend on historical info, we need to persist a deequ repository with analysis results to HDFS.

In T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics we implemented a SerDe to map deequ repositories to Wikimedia Data Quality model, persisted in iceberg. No repo info is lost during SerDe, we simply re-format content to make it meet our data model.

We should implement the reverse "iceberg to deequ" transformation that instantiates a repository from the Data Quality model.

This would remove the need to store deequ repositories json blobs to HDFS.