In T354761, we came up with a first set of data quality metrics. We focused on the issue of data drift when comparing our data lake intermediate table, wmf_dumps.wikitext_raw_rc2, with the Analytics replicas as a source of truth.
But then on T368753, we reimplemented this mechanism in favor of a job that detects inconsistencies and saves the results on wmf_dumps.wikitext_inconsistent_rows_rc1. See table DDL here.
In this task, we want to figure out a way to expose data quality metrics for wmf_dumps.wikitext_raw.
- Figure out what are the more interesting things to know about wmf_dumps.wikitext_raw. Some speculations:
- What percentage of revisions in last 24 hours have had inconsistencies for a specific wiki, say, enwiki?
- What percentage of revisions over all revisions have inconsistencies for, say, enwiki?
- Considering that revision deletes have been shown to be an issue, perhaps we should include amount of revision deletes currently not applied from last 24 hours?
- Does it make sense to expose these metrics thru the Data Quality Framework that the folks from Data Engineering are putting together? If yes, use it, if not, why, and can we generalize our solution as to not have N data quality approaches?
- Go over documentation: https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality
- Play with the python bindings, see if they work. [ Potential to pair up with @gmodena's T353940]
- Consider the comments from @gmodena at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/24#note_70208
- Implement a dashboard in Superset with the metrics
- Maybe with presto hitting wmf_data_ops.data_quality_metrics, or perhaps`wmf_dumps.wikitext_inconsistent_rows_rc1`?
- Implement alerting when the quality metrics breach a certain threshold.