Change Details

We should store the scores generated for a set of revisions with every model that gets committed to a model repo (e.g. #rsaas-editquality, #rsaas-articlequality, etc.). Using this dataset, we can: * Detect improvements/regressions in predictions over time * Give us an early warning when something's wrong with the model/environment * Demonstrate improvement on known poor predictions Once we have {T160224}, we'll also be able to generate scores historically. Gist of a plan: 1. Figure out a workflow with the `revscoring score` utility that will store a set of scores in some sane way. 2. Write `revscoring reflect <score-files>` that will take a set of constraints for how the most recent scores can vary from the previous scores without causing an error. (Error can then be used as a deployment check and maybe a Travis check too)Compare scores across model changes to get a sense for what types of score changes are OK. 3. Write `revscoring reflect <score-files>` that will take a set of constraints for how the most recent scores can vary from the previous scores without causing an error. (Error can then be used as a deployment check and maybe a Travis check too) 4. Load historic scores into a public database for auditing and analysis (probably **way** out of scope, but good to think about)