Page MenuHomePhabricator

Verify newly generated data before pushing it to production
Open, MediumPublic

Description

Once we generate article normalized scores for production use, we need to verify that the data is not corrupt.

A few automatic checks should do it for now:

  • Given language pairs, the total number of the normalized scores should not differ more than 1% from the previous results or 1000 whichever is lower. For example, if today en-uz.tsv has 5,690,030 results, the next time we generate data, this number should be between 5,689,030 and 5,691,030. These numbers are arbitrary, but make sense because we don't expect many new articles created fast. We could look at historical numbers and come up with better thresholds too.
  • Given language pairs and the date, retrieve the count of articles for those language wikis and date, and calculate the difference (say, Δ) in article counts. Make sure that the number of recommendations is within 1% of Δ.
  • Make sure that the recommendation scores are floating point numbers between 0 and 1.
  • ...

Details

Related Gerrit Patches:
research/article-recommender : masterQuality check data

Event Timeline

bmansurov triaged this task as Medium priority.Jan 14 2019, 9:33 PM
bmansurov created this task.
bmansurov updated the task description. (Show Details)May 31 2019, 11:30 AM
bmansurov updated the task description. (Show Details)May 31 2019, 11:49 AM
bmansurov claimed this task.Jun 4 2019, 3:56 AM
bmansurov moved this task from Backlog to In Progress on the Recommendation-API board.

Change 517102 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Quality check data

https://gerrit.wikimedia.org/r/517102

leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 4:05 PM