Complete beta version of pcfg_scorer and approximate overhead
Complete beta version of pcfg_scorer and approximate size of

  1. Pickled PCFGScorer object
  2. CountFiles used to adequately train PCFGScorer objects.

PCFG object beta complete

Object has both a parser and a scorer.

Proposed future strategy:

Use penn treebank or some large public treebank to train generic PCFG *parser*. In particular, need to get a counts file like

Use our trained parser to parse WP revisions (regular and vandalous) in order to get *two more counts* files to train two PCFG *scorers* p_vandal and p_regular. Looks like the tokenizer will be straightforward (thanks Aaron:

Add features, for, say

min_{s \in revision}(log(p_{vandal}(s))) - min_{s \in revision}(log(p_{regular}(s)))

