Develop a method for gathering Featured Article sentence for training a PCFG model.
Naive thoughts on strategies:
- Parse featured articles
Develop a method for gathering Featured Article sentence for training a PCFG model.
Naive thoughts on strategies:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T148038 [Epic] Build draft quality model (spam, vandalism, attack, or OK) | |||
Open | None | T144636 [Epic] Implement PCFG features for editquality and draftquality | |||
Resolved | Halfak | T151819 Analyze differentiation of FA, Spam, Vandalism, and Attack models/sentences. | |||
Resolved | Halfak | T148037 Generate PCFG sentence models | |||
Resolved | Halfak | T148033 Sentence bank for Featured Articles | |||
Resolved | Halfak | T148867 Implement sentences datascources |
I wrote a query to get the most recent version of FA articles. See https://quarry.wmflabs.org/query/13818
Here are my notes from IRC today.
[14:23:25] <halfak> Looks like we'll get about 1m sentences from all the featured articles in English Wikipedia [14:23:49] <halfak> Somewhere between 1m and 2m [14:24:11] <halfak> It's pretty cool to watch the # of sentence per article fly by [14:24:20] <halfak> We get about 50 in the smallest featured articles [14:24:30] <halfak> And 700 in the biggest [14:38:01] <halfak> OK. I need to do some cleanup, but it looks like I get roughy 1.4 million sentences. [14:45:32] <halfak> I just made some adjustments to the sentence segmenter. We should see the number of sentences drop slightly. [15:09:38] <halfak> Down to 1.3 million sentences. [15:09:47] <halfak> There's a lot of crap in here that doesn't look like a sentence. [15:10:04] <halfak> Looks like spacy need to parse them into *something*