Tree-banks (parses) of Wikipedia sentences (both regular and vandalous) are needed in order to train PCFGs to score sentences of new edits with respect to Wiki-grammar (both regular and vandalous).
General strategy is to train a PCFG on a general treebank corpus (say, WSJ or Penn), in order to parse wikipedia sentences.
The two main issues will be
- Determining how to deal with markup/wikitext.
- Determining how to deal with unknown words / proper noun phrases.
Here is a gist that will allow one to explore what kinds of markup remain to be dealt with (and which kinds disappear without warning.)