Page MenuHomePhabricator

Determine how to build WP phrase-structure tree-bank.
Closed, ResolvedPublic

Description

Tree-banks (parses) of Wikipedia sentences (both regular and vandalous) are needed in order to train PCFGs to score sentences of new edits with respect to Wiki-grammar (both regular and vandalous).

General strategy is to train a PCFG on a general treebank corpus (say, WSJ or Penn), in order to parse wikipedia sentences.

The two main issues will be

  1. Determining how to deal with markup/wikitext.
  1. Determining how to deal with unknown words / proper noun phrases.

    Here is a gist that will allow one to explore what kinds of markup remain to be dealt with (and which kinds disappear without warning.)

https://gist.github.com/anonymous/3f64be27604c1aa00fc2