Page MenuHomePhabricator

Convert to nettrom style WP 1.0 label extraction process
Closed, ResolvedPublic

Description

I'm struggling to replicate @Nettrom's article quality labeling in wikilabels. When I perform an extraction, train and test a model, I'll get ~54% accuracy. When I use Nettrom's labeled data (and my extract my feature set), I'll get 61% accuracy. What's the difference here?

Please review the enwiki extractor and the Makefile command.

Event Timeline

Halfak created this task.Mar 17 2016, 9:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2016, 9:35 PM

Hey @Nettrom, would you take a look at this. It might help if we meet and talk about how the extractor is intended to work and compare your process side-by-side.

FWIW, we're also getting ~54% accuracy in frwiki with this extraction strategy, so I expect that our accuracy will go up there once we clean up our labels extractor.

I couldn't find anything in the label extractor that's cause for concern. Having though about it, I suspect there are significant differences in our overarching methodology. A hangout is probably the best approach to walk through it, I'll email @Halfak so we can get moving on that.

@Halfak : I added a comment to https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-03-29, not sure if I should've done that or commented here instead? Let me know if I should copy it over here.

Apart from that, very interesting findings, hopefully the kinks are ironed out now so the models can be trained, curious to see the results!

Halfak added a comment.Apr 8 2016, 7:29 PM

Some notes on the last run: https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-04-08 Still haven't actually experimented with train/test yet, but it looks like are getting close.

https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-04-08

Looks like we get Accuracy: 0.575

So, a bit better, but not great. I propose we switch to @Nettrom's method entirely and give that a try.

Hmmm… these are interesting results. I'm wondering if my approach results in a less noisy dataset, but also not sure how we would go about figuring that out.

One of the limitations of my data gathering process is that it starts from the current state of assessment ratings, which means that you'll only ever see x number of FAs, for however many FAs there currently is (4,704 at the moment).

I recently came across a research paper[0] that used "synthetic minority over-sampling", a technique described by Chawla et al in 2002[1] where the minority class is over-sampled by generating synthetic samples from the k nearest neighbors of the minority class (instead of just duplicating samples). It wasn't difficult to find a Python implementation[2] related to scikit-learn that you might want to give a try, I' be curious to know what happens if you double the size of the dataset.

References:
0: http://arxiv.org/abs/1603.01987
1: http://www.jair.org/media/953/live-953-2037-jair.pdf
2: http://comments.gmane.org/gmane.comp.python.scikit-learn/5278

Halfak renamed this task from Fix WP 1.0 label extraction process for English Wikipedia to Convert to nettrom style WP 1.0 label extraction process.May 16 2016, 3:54 PM
Halfak triaged this task as Low priority.Jul 5 2016, 2:32 PM
Aklapper removed Halfak as the assignee of this task.Jun 19 2020, 4:29 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 19 2020, 4:29 PM
Halfak closed this task as Resolved.Jun 23 2020, 3:55 PM
Halfak claimed this task.

This is done.