Page MenuHomePhabricator

Verify newly generated data before pushing it to production
Open, MediumPublic

Description

Once we generate article normalized scores for production use, we need to verify that the data is not corrupt.

A few automatic checks should do it for now:

  • Given language pairs, the total number of the normalized scores should not differ more than 1% from the previous results or 1000 whichever is lower. For example, if today en-uz.tsv has 5,690,030 results, the next time we generate data, this number should be between 5,689,030 and 5,691,030. These numbers are arbitrary, but make sense because we don't expect many new articles created fast. We could look at historical numbers and come up with better thresholds too.
  • Given language pairs and the date, retrieve the count of articles for those language wikis and date, and calculate the difference (say, Δ) in article counts. Make sure that the number of recommendations is within 1% of Δ.
  • Make sure that the recommendation scores are floating point numbers between 0 and 1.
  • ...

Event Timeline

bmansurov triaged this task as Medium priority.Jan 14 2019, 9:33 PM
bmansurov created this task.

Change 517102 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Quality check data

https://gerrit.wikimedia.org/r/517102

Change 517102 abandoned by Bmansurov:
[research/article-recommender@master] Quality check data

Reason:

https://gerrit.wikimedia.org/r/517102