In T354761, we came up with a first set of data quality metrics. We focused on the issue of data drift when comparing our data lake intermediate table, wmf_dumps.wikitext_raw_rc2, with the Analytics replicas as a source of truth.
In this task, we want to expose these metrics thru the Data Quality Framework that the folks from Data Engineering are putting together.
- Go over documentation: https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality
- Play with the python bindings, see if they work. [ Potential to pair up with @gmodena's T353940]
- Consider the comments from @gmodena at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/24#note_70208
- Implement the solution, including a dashboard where the metrics can be perused.