Mini blogpost for Article Quality Score dataset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• DarTar
	Sep 26 2016, 10:35 PM

Description

Jotting down a blurb for a mini blogpost announcing the dataset. Gave Comms the heads up.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Halfak	T145332 Formal publication of article quality score dataset
		Resolved		Halfak	T146709 Mini blogpost for Article Quality Score dataset

Event Timeline

• DarTar created this task.Sep 26 2016, 10:35 PM

@MelodyKramer talked to me a little about it. I think we should keep in touch.

Halfak moved this task from Unsorted to Blocked on community input on the Machine-Learning-Team board.Oct 3 2016, 4:51 PM

https://docs.google.com/document/d/144w27s1VktfCPuovdU1dUl4as_dLfJyHvknMv1Zt48E/edit?ts=57e99446#

For those without access, here's a summary:

Headline/summary

Wikipedia Quality Trends Dataset

Wikimedia Research here with another FANTASTIC dataset. Looking to explore how Wikipedia articles have improved over time? Frustrated with building propensity models to deal with sporadic quality re-assessments captured in talk page templates? Boy do we have a dataset for you!

Author

Aaron Halfaker, Principal Research Scientist
Amir Sarabadani, Volunteer
Image(s)
https://commons.wikimedia.org/wiki/File:Enwiki.biology.monthly_wp10.svg

Body

Today, we’re announcing the release of a dataset that captures trends in Wikipedia article quality. In the past, explorations into article quality trends in Wikipedia have been complex and difficult to pursue due to the unpredictability of when articles are re-assessed. Due to the manual assessment process, article quality assessments tend to lag behind the real quality of an article.

In order to pave the way for studies of Wikipedian processes that lead to quality, we’ve generated and published a dataset containing article quality scores for article-months since January 2001. Each row has an article quality prediction based on text-only machine classifier (from [1] with slight improvement) hosted by ORES[2]. We've managed to build high quality prediction models for English, French, and Russian Wikipedias, so our team has generated datasets for each of those wikis. The data is current as of August, 2016. We plan to expand to new wikis and to run updates periodically.

Here’s the citation for the data itself.

Halfaker, Aaron (2016): Monthly Wikipedia article quality predictions. figshare. 
https://dx.doi.org/10.6084/m9.figshare.3859800
Retrieved: 00 56, Oct 12, 2016 (GMT)

The files are compressed tab-separated values with the following columns:

page_id -- The page identifier
page_title -- The title of the article (UTF-8 encoded)
rev_id -- The most recent revision ID at the time of assessment
timestamp -- The timestamp when the assessment was taken (YYYYMMDDHHMMSS)
prediction -- The predicted quality class ("Stub", "Start", "C", "B", "GA", "FA", ...)
weighted_sum -- The sum of prediction weights assuming indexed class ordering ("Stub" = 0, "Start" = 1, ...)

Warncke-Wang, M., Cosley, D., & Riedl, J. (2013, August). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (p. 8). ACM. PDF
https://ores.wikimedia.org/

I imagine reading the summary in Billy Mays' voice. I'm not sure that's such a great idea, but it was fun for me.

I'm thinking the 2015 CSCW paper is a better citation for this dataset (and ORES in general moving forward). While the approach is the same in both that and the "Tell Me More" paper, the more recent paper does a much better job of figuring out the appropriate features and the data gathering for training it is much better. So it's overall just more similar to the model ORES has.

Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. "The Success and Failure of Quality Improvement Projects in Peer Production Communities", in the proceedings of the 18th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW). http://www-users.cs.umn.edu/~morten/publications/cscw2015-improvementprojects.pdf

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Oct 26 2016, 10:54 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 1 2016, 6:37 PM

https://blog.wikimedia.org/2016/10/27/wikipedia-quality-trends-dataset/

Halfak closed this task as Resolved.Nov 3 2016, 8:45 PM

Mini blogpost for Article Quality Score datasetClosed, ResolvedPublicActions