Page MenuHomePhabricator

Determine how to build WP phrase-structure tree-bank.
Closed, ResolvedPublic

Description

Tree-banks (parses) of Wikipedia sentences (both regular and vandalous) are needed in order to train PCFGs to score sentences of new edits with respect to Wiki-grammar (both regular and vandalous).

General strategy is to train a PCFG on a general treebank corpus (say, WSJ or Penn), in order to parse wikipedia sentences.

The two main issues will be

  1. Determining how to deal with markup/wikitext.
  1. Determining how to deal with unknown words / proper noun phrases.

    Here is a gist that will allow one to explore what kinds of markup remain to be dealt with (and which kinds disappear without warning.)

https://gist.github.com/anonymous/3f64be27604c1aa00fc2

Event Timeline

aetilley created this task.Jan 1 2016, 6:43 PM
aetilley updated the task description. (Show Details)
aetilley raised the priority of this task from to Needs Triage.
aetilley claimed this task.
aetilley added subscribers: aetilley, Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 1 2016, 6:43 PM
aetilley moved this task from Review to Backlog on the Scoring-platform-team (Current) board.
ToAruShiroiNeko set Security to None.
Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM