Migrate the MWHistoryChecker to DeeQu framework:
Follow on from this task:
Migrate the MWHistoryChecker to DeeQu framework:
Follow on from this task:
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Update MediawikiHistory check denormalize dag | repos/data-engineering/airflow-dags!688 | ebysans | mw | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | lbowmaker | T345912 [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents | |||
Open | Snwachukwu | T361016 [Data Quality] Migrate MWHistoryChecker to DeeQu checks |
Change #1024423 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):
[analytics/refinery/source@master] Upgrade MediawikiHistory Checker to use AWS Deequ. 1. Update User history checker 2. Update Page history checker 3. Update Denormalized history checker
Change #1024423 merged by Snwachukwu:
[analytics/refinery/source@master] Upgrade MediawikiHistory Checker to use AWS Deequ. 1. Update User history checker 2. Update Page history checker 3. Update Denormalized history checker
ebysans opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/688
Update MediawikiHistory check denormalize dag
amastilovic merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/688
Update MediawikiHistory check denormalize dag
Change #1047599 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):
[analytics/refinery/source@master] Fix MediawikiHistory Checker Null Exceptions
Change #1047599 merged by jenkins-bot:
[analytics/refinery/source@master] Fix MediawikiHistory Checker Null Exceptions
There is a fix to resolve the Null input Exceptions experienced after deploying the migration. We get null values in mediawiki history because the denominator used to derive the growth is 0.
There was a bug with the original MediawikiChecker. Usually we compare Mediawiki history previous snapshot and new snapshot and get the growth ratio which is gotten by dividing difference between the two snapshot by the value of the previous snapshot(i.e. denominator). Sometimes the value of the previous snapshot could be 0.
Old Mediawiki checker didn't have issues comparing null values to growth thresholds. However, deequ doesn’t like all null columns.
To handle this, @JAllemandou and I think its best we change denominator to 1 if 0. i.e
COALESCE(NULLIF(value, 0), 1)