We should store the scores generated for a set of revisions with every model that gets committed to a model repo (e.g. editquality-modeling, articlequality-modeling, etc.). Using this dataset, we can:
- Detect improvements/regressions in predictions over time
- Give us an early warning when something's wrong with the model/environment
- Demonstrate improvement on known poor predictions
Once we have T160224: Store docker images in a repo that replicate the train/test/deploy environment for models, we'll also be able to generate scores historically.
Gist of a plan:
- Figure out a workflow with the revscoring score utility that will store a set of scores in some sane way.
- Compare scores across model changes to get a sense for what types of score changes are OK.
- Write revscoring reflect <score-files> that will take a set of constraints for how the most recent scores can vary from the previous scores without causing an error. (Error can then be used as a deployment check and maybe a Travis check too)
- Load historic scores into a public database for auditing and analysis (probably way out of scope, but good to think about)