Page MenuHomePhabricator

[DQ][NEEDS GROOMING] Add support for deequ's RowLevelSchemaValidator in refinery
Open, Needs TriagePublic3 Estimated Story Points

Description

In T354566: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks we developed a PoC to report metrics at row level.

This was useful in the context of anomaly detection, where we wanted to report metrics at country granularity.

We have an increasing need of this kind of instrumentation across pipeline (actor signature, mw history).

Deequ offers a RowLevelSchemaValidator for row level filtering that could help support these use cases.

We need to add alert and metrics constructors to wrap that API in our iceberg data quality tables exporters.

Event Timeline

gmodena set the point value for this task to 3.Fri, Apr 19, 6:33 AM

Based on the Mediawiki History checker use case, the RowLevelSchemaValidator has some Limitations that may not allow us to use it for our use cases:

  1. The method has only String, Integer, Decimal and Timestamps Column definition. We need other column definitions like Double, Map, etc.
  2. Even if we decide to use the Decimal Column Definition, this definition does not support Maximum value and Minimum Value customisations of the column. In the MediawikiHistory Checker the growth column needs to have a Max and Min value set.

I suggest we try to wrap this deequ APi and functionality that suits our needs.