Maniphest T122728

Determine how to build WP phrase-structure tree-bank.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aetilley
	Jan 1 2016, 6:43 PM

Tags

Referenced Files

None

Subscribers

Description

Tree-banks (parses) of Wikipedia sentences (both regular and vandalous) are needed in order to train PCFGs to score sentences of new edits with respect to Wiki-grammar (both regular and vandalous).

General strategy is to train a PCFG on a general treebank corpus (say, WSJ or Penn), in order to parse wikipedia sentences.

The two main issues will be

Determining how to deal with markup/wikitext.

Determining how to deal with unknown words / proper noun phrases.

Here is a gist that will allow one to explore what kinds of markup remain to be dealt with (and which kinds disappear without warning.)

https://gist.github.com/anonymous/3f64be27604c1aa00fc2

Event Timeline

aetilley created this task.Jan 1 2016, 6:43 PM

aetilley claimed this task.

aetilley raised the priority of this task from to Needs Triage.

aetilley updated the task description. (Show Details)

aetilley added subscribers: aetilley, Halfak.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 1 2016, 6:43 PM

Dereckson added projects: revscoring, Machine-Learning-Team (Active Tasks).Jan 1 2016, 6:47 PM

Dereckson subscribed.

aetilley moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Jan 8 2016, 5:47 PM

aetilley moved this task from Review to Backlog on the Machine-Learning-Team (Active Tasks) board.

ToAruShiroiNeko removed a project: revscoring.Jan 10 2016, 8:51 PM

ToAruShiroiNeko set Security to None.

Redirecting into Project

https://phabricator.wikimedia.org/T123759

Halfak moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Jan 15 2016, 6:16 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM