The Makefile needs to cut a sample and train on that.
I'm a fan of a stratified sampling strategy. We should balance the # of OK and !OK observations for training and then use the new revscoring "population rates" parameter to make sure that the test statistics reflect the real-world rates of each class.
Sounds like territory where we would want a dedicated data set balancing utility, which tallies the label counts, creates shuffled and balanced sets, then records some information about the general population rates. Where should we store that info?
Hmm... So far we have been getting away with storing that data in the Makefile itself. In this case, I'd do a
cat initial_sample.json | \ json2tsv draft_quality | \ sort | uniq -c > \ initial_sample.population_counts.tsv
(grep '"draft_quality": "OK"' initial_sample.json | shuf -n <number>; \ grep '"draft_quality": "OK"' -v initial_sample.json) > balanced_sample.json
We could check in population_counts so that the #s would be documented.
So, the latest version of the draftquality model implements a balanced training set. See https://github.com/wiki-ai/draftquality/blob/master/Makefile#L26 This trims the dataset down to 50k observations.
I'm getting slightly lower overall fitness levels (which could be expected). Maybe we should try again with 100k observations. That would give us 3x as many "OK" observations as !OK observations.
roc_auc (micro=0.979, macro=0.948): vandalism spam OK attack ----------- ------ ---- -------- 0.918 0.965 0.98 0.927
roc_auc (micro=0.979, macro=0.962): attack vandalism spam OK -------- ----------- ------ ----- 0.954 0.942 0.975 0.979
Better fitness here! But still not quite as good as the larger set. It seems like we get a benefit from the massive number of observations in the original training set (907k observations). Maybe we should try bumping the observations up to 200k to see what happens. Why not try?
roc_auc (micro=0.979, macro=0.97): vandalism attack spam OK ----------- -------- ------ ----- 0.954 0.968 0.979 0.979
A little bit of a jump in fitness for attack and vandalism. Spam and OK (most common classes) seem to be the same. I think this is a victory and we should continue from here.