Page MenuHomePhabricator

Sentence bank for Featured Articles
Closed, ResolvedPublic

Description

Develop a method for gathering Featured Article sentence for training a PCFG model.

Naive thoughts on strategies:

  • Parse featured articles

Event Timeline

Halfak created this task.Oct 13 2016, 1:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2016, 1:42 PM
Halfak renamed this task from Generate sentence banks for English Language FA sentences to Sentence bank for English Language FA sentences.Oct 13 2016, 1:43 PM
Halfak updated the task description. (Show Details)
Halfak renamed this task from Sentence bank for English Language FA sentences to Sentence bank for Featured Articles.Oct 13 2016, 1:47 PM
Halfak triaged this task as Normal priority.Oct 13 2016, 2:55 PM
Halfak moved this task from Untriaged to Research & analysis on the Scoring-platform-team board.
Halfak added a comment.Nov 3 2016, 8:53 PM

I wrote a query to get the most recent version of FA articles. See https://quarry.wmflabs.org/query/13818

Here are my notes from IRC today.

[14:23:25] <halfak> Looks like we'll get about 1m sentences from all the featured articles in English Wikipedia
[14:23:49] <halfak> Somewhere between 1m and 2m
[14:24:11] <halfak> It's pretty cool to watch the # of sentence per article fly by
[14:24:20] <halfak> We get about 50 in the smallest featured articles
[14:24:30] <halfak> And 700 in the biggest
[14:38:01] <halfak> OK.  I need to do some cleanup, but it looks like I get roughy 1.4 million sentences. 
[14:45:32] <halfak> I just made some adjustments to the sentence segmenter.  We should see the number of sentences drop slightly. 
[15:09:38] <halfak> Down to 1.3 million sentences. 
[15:09:47] <halfak> There's a lot of crap in here that doesn't look like a sentence. 
[15:10:04] <halfak> Looks like spacy need to parse them into *something*
Halfak claimed this task.Jan 19 2017, 3:37 PM
Halfak moved this task from Active to Done on the Scoring-platform-team (Current) board.
Halfak closed this task as Resolved.Feb 7 2017, 8:31 PM