Complete beta version of pcfg_scorer and approximate size of
- Pickled PCFGScorer object
- CountFiles used to adequately train PCFGScorer objects.
Complete beta version of pcfg_scorer and approximate size of
PCFG object beta complete
https://github.com/aetilley/pcfg
Object has both a parser and a scorer.
Proposed future strategy:
Use penn treebank or some large public treebank to train generic PCFG *parser*. In particular, need to get a counts file like
https://github.com/usami/pcfg/blob/master/counts_file.sample
Use our trained parser to parse WP revisions (regular and vandalous) in order to get *two more counts* files to train two PCFG *scorers* p_vandal and p_regular. Looks like the tokenizer will be straightforward (thanks Aaron:
https://gist.github.com/halfak/1620beae124716504cba)
Add features, for, say
min_{s \in revision}(log(p_{vandal}(s))) - min_{s \in revision}(log(p_{regular}(s)))