Page MenuHomePhabricator

Complete beta version of pcfg_scorer and approximate overhead
Closed, ResolvedPublic

Description

Complete beta version of pcfg_scorer and approximate size of

  1. Pickled PCFGScorer object
  2. CountFiles used to adequately train PCFGScorer objects.

Event Timeline

aetilley claimed this task.
aetilley raised the priority of this task from to Needs Triage.
aetilley updated the task description. (Show Details)
aetilley added subscribers: aetilley, Halfak, Ladsgroup.

PCFG object beta complete

https://github.com/aetilley/pcfg

Object has both a parser and a scorer.

Proposed future strategy:

Use penn treebank or some large public treebank to train generic PCFG *parser*. In particular, need to get a counts file like

https://github.com/usami/pcfg/blob/master/counts_file.sample

Use our trained parser to parse WP revisions (regular and vandalous) in order to get *two more counts* files to train two PCFG *scorers* p_vandal and p_regular. Looks like the tokenizer will be straightforward (thanks Aaron:
https://gist.github.com/halfak/1620beae124716504cba)

Add features, for, say

min_{s \in revision}(log(p_{vandal}(s))) - min_{s \in revision}(log(p_{regular}(s)))

Halfak set Security to None.