Page MenuHomePhabricator

Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw
Open, MediumPublic

Description

In T354761, we came up with a first set of data quality metrics. We focused on the issue of data drift when comparing our data lake intermediate table, wmf_dumps.wikitext_raw_rc2, with the Analytics replicas as a source of truth.

But then on T368753, we reimplemented this mechanism in favor of a job that detects inconsistencies and saves the results on wmf_dumps.wikitext_inconsistent_rows_rc1. See table DDL here.

In this task, we want to figure out a way to expose data quality metrics for wmf_dumps.wikitext_raw.

  • Figure out what are the more interesting things to know about wmf_dumps.wikitext_raw. Some speculations:
    • What percentage of revisions in last 24 hours have had inconsistencies for a specific wiki, say, enwiki?
    • What percentage of revisions over all revisions have inconsistencies for, say, enwiki?
    • Considering that revision deletes have been shown to be an issue, perhaps we should include amount of revision deletes currently not applied from last 24 hours?
  • Does it make sense to expose these metrics thru the Data Quality Framework that the folks from Data Engineering are putting together? If yes, use it, if not, why, and can we generalize our solution as to not have N data quality approaches?
  • Implement a dashboard in Superset with the metrics
    • Maybe with presto hitting wmf_data_ops.data_quality_metrics, or perhaps`wmf_dumps.wikitext_inconsistent_rows_rc1`?
  • Implement alerting when the quality metrics breach a certain threshold.

Event Timeline

xcollazo renamed this task from Hook up data drift metrics into the Data Quality Framework to Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw.Aug 13 2024, 4:18 PM
xcollazo updated the task description. (Show Details)
Milimetric triaged this task as Medium priority.
Milimetric moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.
xcollazo moved this task from In Process to Sprint Backlog on the Dumps 2.0 (Kanban Board) board.