Implement first set of data quality checks
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	xcollazo
	Jan 10 2024, 1:29 PM

Description

On T354275: [Spike] Figure out a mechanism to be able to do data quality checks, we figured the details of a mechanism to do data quality checks via the mariadb analytics replicas.

In this task, we should:

Implement a pyspark job that implements the learnings from T354275.
Pick a first set of data quality checks as defined on T345385 and implement them.
Figure out the next step after this work

For completeness, here are the data quality checks implemented:

Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

Related Objects
Search...

Status	Assigned	Task
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T345385 Epic: Quality of new Dumps 2.0 output
Resolved	xcollazo	T354761 Implement first set of data quality checks

Event Timeline

xcollazo created this task.Jan 10 2024, 1:29 PM

VirginiaPoundstone moved this task from Incoming to NEEDS DISCUSSION on the Experimentation Lab board.Jan 10 2024, 4:48 PM

xcollazo edited projects, added Experimentation Lab (Data Products Sprint 07); removed Experimentation Lab.Jan 16 2024, 6:06 PM

• WDoranWMF edited projects, added Experimentation Lab (Data Products Sprint 08); removed Experimentation Lab (Data Products Sprint 07).Jan 25 2024, 4:55 PM

xcollazo set the point value for this task to 8.Jan 25 2024, 4:58 PM

cjming assigned this task to xcollazo.Jan 29 2024, 5:18 PM

cjming moved this task from Sprint Backlog to In Process on the Experimentation Lab (Data Products Sprint 08) board.

• WDoranWMF edited projects, added Experimentation Lab (Data Products Sprint 09); removed Experimentation Lab (Data Products Sprint 08).Feb 8 2024, 12:15 PM

• WDoranWMF moved this task from Sprint Backlog to In Process on the Experimentation Lab (Data Products Sprint 09) board.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/24

Implementation of data drift checks.

Copy pasting from MR, for compleness:

In this MR we implement a PySpark job that runs 3 data drift checks:

Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

The idea is to eventually use these metrics as a gating mechanism for cutting a dump. For now, we don't save these calculations, we just print them. Later, we will see whether it makes sense to keep these metrics on wmf_data_ops.data_quality_metrics.
Bug: T354761