Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw
Open, MediumPublic
Actions

Assigned To

Authored By

	xcollazo
	Feb 15 2024, 3:45 PM

Description

In T354761, we came up with a first set of data quality metrics. We focused on the issue of data drift when comparing our data lake intermediate table, wmf_dumps.wikitext_raw_rc2, with the Analytics replicas as a source of truth.

But then on T368753, we reimplemented this mechanism in favor of a job that detects inconsistencies and saves the results on wmf_dumps.wikitext_inconsistent_rows_rc1. See table DDL here.

In this task, we want to figure out a way to expose data quality metrics for wmf_dumps.wikitext_raw.

Figure out what are the more interesting things to know about wmf_dumps.wikitext_raw. Some speculations:
- What percentage of revisions in last 24 hours have had inconsistencies for a specific wiki, say, enwiki?
- What percentage of revisions over all revisions have inconsistencies for, say, enwiki?
- Considering that revision deletes have been shown to be an issue, perhaps we should include amount of revision deletes currently not applied from last 24 hours?
Does it make sense to expose these metrics thru the Data Quality Framework that the folks from Data Engineering are putting together? If yes, use it, if not, why, and can we generalize our solution as to not have N data quality approaches?
- Go over documentation: https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality
- Play with the python bindings, see if they work. [ Potential to pair up with @gmodena's T353940]
- Consider the comments from @gmodena at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/24#note_70208

Implement a dashboard in Superset with the metrics
- Maybe with presto hitting wmf_data_ops.data_quality_metrics, or perhaps`wmf_dumps.wikitext_inconsistent_rows_rc1`?

Implement alerting when the quality metrics breach a certain threshold.

Related Objects
Search...

Status	Assigned	Task
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T345385 Epic: Quality of new Dumps 2.0 output
Open	tchin	T357684 Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw

Event Timeline

xcollazo created this task.Feb 15 2024, 3:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2024, 3:45 PM

xcollazo updated the task description. (Show Details)Feb 15 2024, 3:49 PM

VirginiaPoundstone moved this task from Incoming to NEEDS DISCUSSION on the Experimentation Lab board.Feb 16 2024, 4:01 PM

xcollazo updated the task description. (Show Details)Feb 21 2024, 3:00 PM

xcollazo mentioned this in T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.Feb 23 2024, 6:13 PM

Ahoelzl subscribed.Feb 28 2024, 8:45 PM

VirginiaPoundstone moved this task from NEEDS DISCUSSION to Wikistats Backlog on the Experimentation Lab board.Mar 18 2024, 7:51 PM

VirginiaPoundstone moved this task from Wikistats Backlog to NEEDS DISCUSSION on the Experimentation Lab board.Mar 29 2024, 5:13 PM

VirginiaPoundstone moved this task from NEEDS DISCUSSION to Dumps 2 on the Experimentation Lab board.Apr 10 2024, 8:23 PM

VirginiaPoundstone moved this task from Incoming to To be discussed/To be estimated on the Dumps 2.0 board.May 14 2024, 1:39 PM

Ottomata subscribed.Jun 7 2024, 1:37 AM

• lbowmaker moved this task from To be discussed/To be estimated to Kanban Board on the Dumps 2.0 board.Jun 11 2024, 11:17 AM

• lbowmaker edited projects, added Dumps 2.0 (Kanban Board); removed Dumps 2.0.

xcollazo renamed this task from Hook up data drift metrics into the Data Quality Framework to Dashboard and alerting of data quality metrics for wmf_dumps.wikitext_raw.Aug 13 2024, 4:18 PM

xcollazo updated the task description. (Show Details)

data quality in python: https://gitlab.wikimedia.org/gmodena/refinery-python

Milimetric claimed this task.Sep 30 2024, 7:07 PM

Milimetric triaged this task as Medium priority.

Milimetric moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

xcollazo removed Milimetric as the assignee of this task.Nov 1 2024, 3:28 PM

xcollazo moved this task from In Process to Sprint Backlog on the Dumps 2.0 (Kanban Board) board.

Ahoelzl assigned this task to tchin.Nov 6 2024, 7:41 PM

Ahoelzl moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.Mon, Dec 2, 3:13 PM

Ahoelzl added a project: Data-Engineering (Q2 2024 October 1st - December 31th).

Ahoelzl moved this task from Next Up to In progress on the Data-Engineering (Q2 2024 October 1st - December 31th) board.

tchin updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/51