Page MenuHomePhabricator

[Data Quality] Implement basic data quality metrics for MW history
Open, Needs TriagePublic8 Estimated Story Points


The following metrics and checks should be implemented:

  • Provide row counts and alerts for 5% changes
  • Check page_id is not null or 0, see T259823
  • Detect dupes on (wiki_db, event_entity, event_type, event_timestamp)

The duplicate detection should catch the issue we fixed with setting retries to 0 for the job (in case we change it, etc)

Event Timeline

Ahoelzl added a subscriber: Antoine_Quhen.
lbowmaker set the point value for this task to 5.Jan 17 2024, 12:49 PM
Ahoelzl changed the point value for this task from 5 to 8.
Ahoelzl added a subscriber: gmodena.

@JAllemandou mentioned existing MW checks that should be migrated.

Change 1008934 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Mediawiki History Data Quality Metrics

In the code implementation for change in size detection, I have compared previous snapshot and current snapshot by reading previous snapshot into a dataframe.
One TODO will be to use the AWS Deequ Anomaly detectiom and filesystem repository capability to implement this check.

Change #1008934 merged by jenkins-bot:

[analytics/refinery/source@master] Mediawiki History Data Quality Metrics

The originally defined set of columns for uniques check result in almost 45% redundancy:

A new set is proposed for more meaningful tracking:

  • wiki_db
  • event_entity
  • event_type
  • timestamp
  • event_user_text_historical
  • user_text_historical
  • page_id
  • revision_id

Change #1049561 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

Change #1049563 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

Change #1049563 abandoned by Snwachukwu:

[analytics/refinery/source@master] Update column definition for uniqueness check.


Wrong changes

Change #1049561 abandoned by Snwachukwu:

[analytics/refinery/source@master] Update column definition for uniqueness check.


Wrong files added to patch

Change #1049580 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Update column definition for uniqueness check.

Change #1049580 merged by jenkins-bot:

[analytics/refinery/source@master] Update column definition for uniqueness check.

ebysans opened

Use refinery job jar v0.2.43 containing fix for MediawwikiHistory duplicate checker

ebysans merged

Use refinery job jar v0.2.43 containing fix for MediawwikiHistory duplicate checker