Page MenuHomePhabricator

Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1
Closed, ResolvedPublic

Description

In T354761, we came up with a first set of data quality metrics. We focused on the issue of data drift when comparing our data lake intermediate table, wmf_content.mediawiki_content_history_v1, with the Analytics replicas as a source of truth.

But then on T368753, we reimplemented this mechanism in favor of a job that detects inconsistencies and saves the results on wmf_content.inconsistencies_of_mediawiki_content_history_v1. See table DDL here.

In this task, we want to figure out a way to expose data quality metrics for wmf_content.mediawiki_content_history_v1.

  • Figure out what are the more interesting things to know about wmf_content.mediawiki_content_history_v1. Some speculations:
    • What percentage of revisions in last 24 hours have had inconsistencies for a specific wiki, say, enwiki?
    • What percentage of revisions over all revisions have inconsistencies for, say, enwiki?
    • Considering that revision deletes have been shown to be an issue, perhaps we should include amount of revision deletes currently not applied from last 24 hours?
  • Does it make sense to expose these metrics thru the Data Quality Framework that the folks from Data Engineering are putting together? If yes, use it, if not, why, and can we generalize our solution as to not have N data quality approaches?
  • Implement a dashboard in Superset with the metrics
    • Maybe with presto hitting wmf_data_ops.data_quality_metrics, or perhaps wmf_content.inconsistencies_of_mediawiki_content_history_v1?
  • Implement alerting when the quality metrics breach a certain threshold. Moved to new ticket: T384962

Details

TitleReferenceAuthorSource BranchDest Branch
Bump mediawiki-content-history artifact to 0.3.0repos/data-engineering/airflow-dags!1048tchinbump-mw-content-history-artifactmain
Fix metrics deletionrepos/data-engineering/dumps/mediawiki-content-dump!58tchinfix-metrics-deletemain
Add content-history metricsrepos/data-engineering/airflow-dags!973tchincontent-history-metricsmain
Add initial pydeequ metrics scriptrepos/data-engineering/dumps/mediawiki-content-dump!51tchinadd-metricsmain
Customize query in GitLab

Event Timeline

xcollazo renamed this task from Hook up data drift metrics into the Data Quality Framework to Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw.Aug 13 2024, 4:18 PM
xcollazo updated the task description. (Show Details)
Milimetric triaged this task as Medium priority.
Milimetric moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.
xcollazo moved this task from In Process to Sprint Backlog on the Dumps 2.0 (Kanban Board) board.
xcollazo renamed this task from Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw to Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1.Jan 21 2025, 4:12 PM
xcollazo updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-01-22T17:58:13Z] <tchin@deploy2002> Started deploy [airflow-dags/analytics@07104ff]: Deploying latest dags for analytics airflow instance T357684

Mentioned in SAL (#wikimedia-operations) [2025-01-22T17:58:50Z] <tchin@deploy2002> Finished deploy [airflow-dags/analytics@07104ff]: Deploying latest dags for analytics airflow instance T357684 (duration: 01m 53s)

(I just fixed an issue with the pipeline, so compute_metrics should run in the next few hours. )

The task is failing with:

sudo -u analytics yarn logs -applicationId application_1734703658237_1050294


venv/bin/python: can't open file '/var/lib/hadoop/data/d/yarn/local/usercache/analytics/appcache/application_1734703658237_1050294/filecache/10/mediawiki-content-dump-0.2.0.dev0-the-big-rename-pt-3.conda.tgz/bin/compute_metrics.py': [Errno 2] No such file or directory

We were not picking up the new jar because we were overriding via DagProperties.

I've deleted the configs for mw_content_reconcile_mw_content_history_daily and mw_content_reconcile_mw_content_history_monthly.

I just realized we have a bug at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/42fb3f13220600714b76acc3dd6f9ca969f796a2/mediawiki_content_dump/compute_metrics.py#L187:

# Delete old metrics incase of rerun
   spark.sql(f"DELETE FROM {args.metrics_table} WHERE partition_ts = CAST('{args.min_timestamp}' AS TIMESTAMP)")

This DELETE statement should specify source_table as well, otherwise we could delete other data:

spark.sql("""
SELECT DISTINCT source_table
FROM wmf_data_ops.data_quality_metrics
""").show(truncate=False)
[Stage 247:>                                                        (0 + 1) / 1]
+----------------------------------------------------------------------------------------------+
|source_table                                                                                  |
+----------------------------------------------------------------------------------------------+
|`wmf`.`webrequest`                                                                            |
|`wmf`.`mediawiki_history`                                                                     |
|`wmf_content`.`mediawiki_content_history_v1`                                                  |
|`wmf_content`.`inconsistent_rows_of_mediawiki_content_history_v1`                             |
|`wmf_content`.`inconsistent_rows_of_mediawiki_content_history_v1_mediawiki_content_history_v1`|
+----------------------------------------------------------------------------------------------+

Nice catch! Putting up a patch

First compute_metrics run is a success! 🎉

Details at https://yarn.wikimedia.org/spark-history/history/application_1734703658237_1050631/jobs/

It did take ~3 hours though, with 8720 SQL statements... so 10 per wiki.

3 hours is long considering the reconcile itself takes ~2 hours. A problem for another day, but this could use some optimization. Perhaps we can run multiples in parallel, perhaps we can get rid of deeque 😉 .

Mentioned in SAL (#wikimedia-operations) [2025-01-27T18:10:01Z] <tchin@deploy2002> Started deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684

Mentioned in SAL (#wikimedia-operations) [2025-01-27T18:10:39Z] <tchin@deploy2002> Finished deploy [airflow-dags/analytics@c49f40b]: Deploying airflow for T357684 (duration: 01m 01s)

@tchin can we close this? If so, please update this task with a summary of what was accomplished, and with links to tickets with next steps. Thanks!

Summary

  • Created a python script and airflow dag for computing metrics
  • Dogfood refinery-python and therefore PyDeequ
    • refinery-python doesn't work with the latest version of PyDeequ. We're currently pinning it but it should be upgraded.
  • Discovered Deequ has some major quirks, or it's more like we're not using it for its intended purpose
    • Can't directly insert metrics.. Metrics are always computed and therefore associated with an Analyzer.
    • Can't implement custom Analyzers in PyDeequ (GitHub Issue)
    • Can't compute metrics across tables. A workaround had to be used.
    • Doesn't output metrics on empty data (except for size). i.e. Asking it to give the Completeness of a column on a DataFrame of 0 records results in no metrics.
  • Created a Superset dashboard with metrics that were computed, turns out Superset also has some quirks
    • Superset expects a table to compute metrics over, not a table of already computed metrics. Some workarounds had to be used.
    • A dashboard with 500+ wikis is not that helpful, perhaps split it into multiple smaller dashboards
    • Superset does not handle time series data that well. No metrics on a specific day results in no data points. Resampling is only available for some graphs.

Next steps

  • Decide if Deequ is the right choice
  • T382703 Move refinery-python to under the data-engineering group if it is
  • T384871 Improve dashboarding
  • T384962 Add alerting

Beautiful summary, thanks!