Page MenuHomePhabricator

Mini blogpost for Article Quality Score dataset
Closed, ResolvedPublic


Jotting down a blurb for a mini blogpost announcing the dataset. Gave Comms the heads up.

Event Timeline

@MelodyKramer talked to me a little about it. I think we should keep in touch.

For those without access, here's a summary:


Wikipedia Quality Trends Dataset

Wikimedia Research here with another FANTASTIC dataset. Looking to explore how Wikipedia articles have improved over time? Frustrated with building propensity models to deal with sporadic quality re-assessments captured in talk page templates? Boy do we have a dataset for you!


Aaron Halfaker, Principal Research Scientist
Amir Sarabadani, Volunteer


Today, we’re announcing the release of a dataset that captures trends in Wikipedia article quality. In the past, explorations into article quality trends in Wikipedia have been complex and difficult to pursue due to the unpredictability of when articles are re-assessed. Due to the manual assessment process, article quality assessments tend to lag behind the real quality of an article.

In order to pave the way for studies of Wikipedian processes that lead to quality, we’ve generated and published a dataset containing article quality scores for article-months since January 2001. Each row has an article quality prediction based on text-only machine classifier (from [1] with slight improvement) hosted by ORES[2]. We've managed to build high quality prediction models for English, French, and Russian Wikipedias, so our team has generated datasets for each of those wikis. The data is current as of August, 2016. We plan to expand to new wikis and to run updates periodically.

Here’s the citation for the data itself.

Halfaker, Aaron (2016): Monthly Wikipedia article quality predictions. figshare.
Retrieved: 00 56, Oct 12, 2016 (GMT)

The files are compressed tab-separated values with the following columns:

  • page_id -- The page identifier
  • page_title -- The title of the article (UTF-8 encoded)
  • rev_id -- The most recent revision ID at the time of assessment
  • timestamp -- The timestamp when the assessment was taken (YYYYMMDDHHMMSS)
  • prediction -- The predicted quality class ("Stub", "Start", "C", "B", "GA", "FA", ...)
  • weighted_sum -- The sum of prediction weights assuming indexed class ordering ("Stub" = 0, "Start" = 1, ...)
  1. Warncke-Wang, M., Cosley, D., & Riedl, J. (2013, August). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (p. 8). ACM. PDF

I imagine reading the summary in Billy Mays' voice. I'm not sure that's such a great idea, but it was fun for me.

I'm thinking the 2015 CSCW paper is a better citation for this dataset (and ORES in general moving forward). While the approach is the same in both that and the "Tell Me More" paper, the more recent paper does a much better job of figuring out the appropriate features and the data gathering for training it is much better. So it's overall just more similar to the model ORES has.

Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. "The Success and Failure of Quality Improvement Projects in Peer Production Communities", in the proceedings of the 18th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW).