Page MenuHomePhabricator

[Data Quality] Migrate MWHistoryChecker to DeeQu checks
Open, Needs TriagePublic8 Estimated Story Points

Event Timeline

Change #1024423 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Upgrade MediawikiHistory Checker to use AWS Deequ. 1. Update User history checker 2. Update Page history checker 3. Update Denormalized history checker

https://gerrit.wikimedia.org/r/1024423

Change #1024423 merged by Snwachukwu:

[analytics/refinery/source@master] Upgrade MediawikiHistory Checker to use AWS Deequ. 1. Update User history checker 2. Update Page history checker 3. Update Denormalized history checker

https://gerrit.wikimedia.org/r/1024423

Change #1047599 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery/source@master] Fix MediawikiHistory Checker Null Exceptions

https://gerrit.wikimedia.org/r/1047599

Change #1047599 merged by jenkins-bot:

[analytics/refinery/source@master] Fix MediawikiHistory Checker Null Exceptions

https://gerrit.wikimedia.org/r/1047599

There is a fix to resolve the Null input Exceptions experienced after deploying the migration. We get null values in mediawiki history because the denominator used to derive the growth is 0.

There was a bug with the original MediawikiChecker. Usually we compare Mediawiki history previous snapshot and new snapshot and get the growth ratio which is gotten by dividing difference between the two snapshot by the value of the previous snapshot(i.e. denominator). Sometimes the value of the previous snapshot could be 0.
Old Mediawiki checker didn't have issues comparing null values to growth thresholds. However, deequ doesn’t like all null columns.

To handle this, @JAllemandou and I think its best we change denominator to 1 if 0. i.e
COALESCE(NULLIF(value, 0), 1)