Once we generate article normalized scores for production use, we need to verify that the data is not corrupt.
A few automatic checks should do it for now:
- Given language pairs, the total number of the normalized scores should not differ more than 1% from the previous results or 1000 whichever is lower. For example, if today en-uz.tsv has 5,690,030 results, the next time we generate data, this number should be between 5,689,030 and 5,691,030. These numbers are arbitrary, but make sense because we don't expect many new articles created fast. We could look at historical numbers and come up with better thresholds too.
- Given language pairs and the date, retrieve the count of articles for those language wikis and date, and calculate the difference (say, Δ) in article counts. Make sure that the number of recommendations is within 1% of Δ.
- Make sure that the recommendation scores are floating point numbers between 0 and 1.