Page MenuHomePhabricator

Sentence bank for Featured Articles
Closed, ResolvedPublic


Develop a method for gathering Featured Article sentence for training a PCFG model.

Naive thoughts on strategies:

  • Parse featured articles

Event Timeline

Halfak renamed this task from Generate sentence banks for English Language FA sentences to Sentence bank for English Language FA sentences.Oct 13 2016, 1:43 PM
Halfak updated the task description. (Show Details)
Halfak renamed this task from Sentence bank for English Language FA sentences to Sentence bank for Featured Articles.Oct 13 2016, 1:47 PM
Halfak triaged this task as Medium priority.Oct 13 2016, 2:55 PM
Halfak moved this task from Unsorted to Research & analysis on the Machine-Learning-Team board.

I wrote a query to get the most recent version of FA articles. See

Here are my notes from IRC today.

[14:23:25] <halfak> Looks like we'll get about 1m sentences from all the featured articles in English Wikipedia
[14:23:49] <halfak> Somewhere between 1m and 2m
[14:24:11] <halfak> It's pretty cool to watch the # of sentence per article fly by
[14:24:20] <halfak> We get about 50 in the smallest featured articles
[14:24:30] <halfak> And 700 in the biggest
[14:38:01] <halfak> OK.  I need to do some cleanup, but it looks like I get roughy 1.4 million sentences. 
[14:45:32] <halfak> I just made some adjustments to the sentence segmenter.  We should see the number of sentences drop slightly. 
[15:09:38] <halfak> Down to 1.3 million sentences. 
[15:09:47] <halfak> There's a lot of crap in here that doesn't look like a sentence. 
[15:10:04] <halfak> Looks like spacy need to parse them into *something*