On T354275: [Spike] Figure out a mechanism to be able to do data quality checks, we figured the details of a mechanism to do data quality checks via the mariadb analytics replicas.
In this task, we should:
- Implement a pyspark job that implements the learnings from T354275.
- Pick a first set of data quality checks as defined on T345385 and implement them.
- Figure out the next step after this work
For completeness, here are the data quality checks implemented:
- Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
- Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
- Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.