Page MenuHomePhabricator

[Data Quality] Implement basic data quality metrics for MW history
Open, Needs TriagePublic8 Estimated Story Points

Description

The following metrics and checks should be implemented:

  • Provide row counts and alerts for 5% changes
  • Check page_id is not null or 0, see T259823
  • Detect dupes on (wiki_db, event_entity, event_type, event_timestamp)

The duplicate detection should catch the issue we fixed with setting retries to 0 for the job (in case we change it, etc)

Event Timeline

Ahoelzl added a subscriber: Antoine_Quhen.
lbowmaker set the point value for this task to 5.Jan 17 2024, 12:49 PM
Ahoelzl changed the point value for this task from 5 to 8.
Ahoelzl added a subscriber: gmodena.

@JAllemandou mentioned existing MW checks that should be migrated.

Change 1008934 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Mediawiki History Data Quality Metrics

https://gerrit.wikimedia.org/r/1008934

In the code implementation for change in size detection, I have compared previous snapshot and current snapshot by reading previous snapshot into a dataframe.
One TODO will be to use the AWS Deequ Anomaly detectiom and filesystem repository capability to implement this check.

Change #1008934 merged by jenkins-bot:

[analytics/refinery/source@master] Mediawiki History Data Quality Metrics

https://gerrit.wikimedia.org/r/1008934

The originally defined set of columns for uniques check result in almost 45% redundancy:
https://superset.wikimedia.org/superset/explore/p/dk7bE7BQjzV/

A new set is proposed for more meaningful tracking:

  • wiki_db
  • event_entity
  • event_type
  • timestamp
  • event_user_text_historical
  • user_text_historical
  • page_id
  • revision_id

Change #1049561 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

https://gerrit.wikimedia.org/r/1049561

Change #1049563 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

https://gerrit.wikimedia.org/r/1049563

Change #1049563 abandoned by Snwachukwu:

[analytics/refinery/source@master] Update column definition for uniqueness check.

Reason:

Wrong changes

https://gerrit.wikimedia.org/r/1049563

Change #1049561 abandoned by Snwachukwu:

[analytics/refinery/source@master] Update column definition for uniqueness check.

Reason:

Wrong files added to patch

https://gerrit.wikimedia.org/r/1049561

Change #1049580 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

https://gerrit.wikimedia.org/r/1049580

Change #1049580 merged by jenkins-bot:

[analytics/refinery/source@master] Update column definition for uniqueness check.

https://gerrit.wikimedia.org/r/1049580

ebysans opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/744

Use refinery job jar v0.2.43 containing fix for MediawwikiHistory duplicate checker

ebysans merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/744

Use refinery job jar v0.2.43 containing fix for MediawwikiHistory duplicate checker