I'm struggling to replicate @Nettrom's article quality labeling in wikilabels. When I perform an extraction, train and test a model, I'll get ~54% accuracy. When I use Nettrom's labeled data (and my extract my feature set), I'll get 61% accuracy. What's the difference here?
|Invalid||None||T130259 [Epic] Article quality models (wp10)|
|Resolved||Halfak||T130312 Convert to nettrom style WP 1.0 label extraction process|
Hey @Nettrom, would you take a look at this. It might help if we meet and talk about how the extractor is intended to work and compare your process side-by-side.
FWIW, we're also getting ~54% accuracy in frwiki with this extraction strategy, so I expect that our accuracy will go up there once we clean up our labels extractor.
See notes here https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-03-25
and here: https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-03-29
I'm still waiting for the last extraction to finish and then I'll try training models again.
@Halfak : I added a comment to https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-03-29, not sure if I should've done that or commented here instead? Let me know if I should copy it over here.
Apart from that, very interesting findings, hopefully the kinks are ironed out now so the models can be trained, curious to see the results!
Some notes on the last run: https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_article_quality/Work_log/2016-04-08 Still haven't actually experimented with train/test yet, but it looks like are getting close.
Looks like we get Accuracy: 0.575
So, a bit better, but not great. I propose we switch to @Nettrom's method entirely and give that a try.
Hmmm… these are interesting results. I'm wondering if my approach results in a less noisy dataset, but also not sure how we would go about figuring that out.
One of the limitations of my data gathering process is that it starts from the current state of assessment ratings, which means that you'll only ever see x number of FAs, for however many FAs there currently is (4,704 at the moment).
I recently came across a research paper that used "synthetic minority over-sampling", a technique described by Chawla et al in 2002 where the minority class is over-sampled by generating synthetic samples from the k nearest neighbors of the minority class (instead of just duplicating samples). It wasn't difficult to find a Python implementation related to scikit-learn that you might want to give a try, I' be curious to know what happens if you double the size of the dataset.
This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!
For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)