We should store the scores generated for a set of revisions with every model that gets committed to a model repo (e.g. #rsaas-editquality, #rsaas-articlequality, etc.). Using this dataset, we can:
* Detect improvements/regressions in predictions over time
* Give us an early warning when something's wrong with the model/environment
* Demonstrate improvement on known poor predictions
Once we have {T160224}, we'll also be able to generate scores historically.
Gist of a plan:
1. Figure out a workflow with the `revscoring score` utility that will store a set of scores in some sane way.
2. Write `revscoring reflect <score-files>` that will take a set of constraints for how the most recent scores can vary from the previous scores without causing an error. (Error can then be used as a deployment check and maybe a Travis check too)
3. Load historic scores into a public database for auditing and analysis (probably **way** out of scope, but good to think about)