Page MenuHomePhabricator

Complete beta version of pcfg_scorer and approximate overhead
Closed, ResolvedPublic

Description

Complete beta version of pcfg_scorer and approximate size of

  1. Pickled PCFGScorer object
  2. CountFiles used to adequately train PCFGScorer objects.

Event Timeline

aetilley created this task.Dec 11 2015, 7:06 PM
aetilley updated the task description. (Show Details)
aetilley raised the priority of this task from to Needs Triage.
aetilley claimed this task.
aetilley added subscribers: aetilley, Halfak, Ladsgroup.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 11 2015, 7:06 PM

PCFG object beta complete

https://github.com/aetilley/pcfg

Object has both a parser and a scorer.

Proposed future strategy:

Use penn treebank or some large public treebank to train generic PCFG *parser*. In particular, need to get a counts file like

https://github.com/usami/pcfg/blob/master/counts_file.sample

Use our trained parser to parse WP revisions (regular and vandalous) in order to get *two more counts* files to train two PCFG *scorers* p_vandal and p_regular. Looks like the tokenizer will be straightforward (thanks Aaron:
https://gist.github.com/halfak/1620beae124716504cba)

Add features, for, say

min_{s \in revision}(log(p_{vandal}(s))) - min_{s \in revision}(log(p_{regular}(s)))

Halfak closed this task as Resolved.Jan 21 2016, 3:42 PM
Halfak set Security to None.