Page MenuHomePhabricator

[Discuss] draftquality on a sample, humongous everything, or something else?
Closed, ResolvedPublic

Description

The Makefile needs to cut a sample and train on that.

Event Timeline

Halfak renamed this task from draftquality should be trained on a sample, rather than humongous everything to [Discuss] draftquality on a sample, humongous everything, or something else?.Jun 29 2017, 3:04 PM
Halfak added a project: draftquality-modeling.
Halfak updated the task description. (Show Details)

@Halfak, make notes about what options there are.

I'm a fan of a stratified sampling strategy. We should balance the # of OK and !OK observations for training and then use the new revscoring "population rates" parameter to make sure that the test statistics reflect the real-world rates of each class.

Sounds like territory where we would want a dedicated data set balancing utility, which tallies the label counts, creates shuffled and balanced sets, then records some information about the general population rates. Where should we store that info?

Hmm... So far we have been getting away with storing that data in the Makefile itself. In this case, I'd do a

cat initial_sample.json | \
  json2tsv draft_quality | \
  sort | uniq -c > \
initial_sample.population_counts.tsv

and then

(grep '"draft_quality": "OK"' initial_sample.json | shuf -n <number>; \
 grep '"draft_quality": "OK"' -v initial_sample.json) > balanced_sample.json

We could check in population_counts so that the #s would be documented.

So, the latest version of the draftquality model implements a balanced training set. See https://github.com/wiki-ai/draftquality/blob/master/Makefile#L26 This trims the dataset down to 50k observations.

I'm getting slightly lower overall fitness levels (which could be expected). Maybe we should try again with 100k observations. That would give us 3x as many "OK" observations as !OK observations.

	roc_auc (micro=0.979, macro=0.948):
		  vandalism    spam    OK    attack
		-----------  ------  ----  --------
		      0.918   0.965  0.98     0.927

OK I'm trying again with ~100k observations. I just increased the "OK" sample to 75k and left the !OK set at 26.1k so it's a little over 100k. Meh.

roc_auc (micro=0.979, macro=0.962):
          attack    vandalism    spam     OK
        --------  -----------  ------  -----
           0.954        0.942   0.975  0.979

Better fitness here! But still not quite as good as the larger set. It seems like we get a benefit from the massive number of observations in the original training set (907k observations). Maybe we should try bumping the observations up to 200k to see what happens. Why not try?

roc_auc (micro=0.979, macro=0.97):
                  vandalism    attack    spam     OK
                -----------  --------  ------  -----
                      0.954     0.968   0.979  0.979

A little bit of a jump in fitness for attack and vandalism. Spam and OK (most common classes) seem to be the same. I think this is a victory and we should continue from here.