Page MenuHomePhabricator

Implement first set of data quality checks
Closed, ResolvedPublic8 Estimated Story Points

Description

On T354275: [Spike] Figure out a mechanism to be able to do data quality checks, we figured the details of a mechanism to do data quality checks via the mariadb analytics replicas.

In this task, we should:

  • Implement a pyspark job that implements the learnings from T354275.
  • Pick a first set of data quality checks as defined on T345385 and implement them.
  • Figure out the next step after this work

For completeness, here are the data quality checks implemented:

  • Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
  • Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
  • Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

Event Timeline

xcollazo set the point value for this task to 8.Jan 25 2024, 4:58 PM

Copy pasting from MR, for compleness:

In this MR we implement a PySpark job that runs 3 data drift checks:

Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

The idea is to eventually use these metrics as a gating mechanism for cutting a dump. For now, we don't save these calculations, we just print them. Later, we will see whether it makes sense to keep these metrics on wmf_data_ops.data_quality_metrics.
Bug: T354761

I've added @gmodena and @JAllemandou as reviewers for the MR.

My intention with this task is to get an agreement on whether the metrics being calculated make sense.

Work to get these metrics hooked into the Data Quality Framework will be done separately.