Page MenuHomePhabricator

Improve cleaning of article quality assessment datasets
Closed, ResolvedPublic


Four proposed improvements to gathering article assessment datasets:

  1. Talk pages without a corresponding main namespace page
  2. Talk pages where the corresponding main namespace page is a redirect
  3. Talk pages that have "/Archive" in their name path
  4. Talk pages where the corresponding main namespace page is a disambiguation page

Also, consider adding a cutoff date to scripts that gather article assessment data to enable only gathering assessments more recent than a given date. An analysis of the dates of assessments in the current datasets is forthcoming.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I've gathered revision timestamps for all the revisions in the published dataset, and also checked for redirects. Here are some summaries:

  1. Number of redirects: 11 (at least, I did a case-insentive grep for "#redirect" in the revision data)
  2. Number of deleted revisions: 73
  3. Number of unavailable revisions: 0

I next made a graph of assessments per year: Looks like 2007 is the year when these quality assessments started to be in common use.

Next I plan to look at my own code as well as the ORES code for data gathering and see where improvements can be made.

@Keer25: Nettrom is the main dev for the wp10 model, and he pointed me here when I noted that issue of the redirect predicted as FA

@Nettrom we're working a lot with wp10 data for the new few weeks, for Keerthana's GSoC project. Let us know if there are ways we can be helpful, such as reporting other anomalies or scores that don't make sense.

@Nettrom @Halfak The bad scoring for redirects is the biggest problem for the automated suggestions feature we've been building (which just deployed to production).

Here's another case, with this redirect predicted as B class:

Even if the prediction didn't change, providing that feature in the output (like 'features.redirect': true) would make it easier to work around/with.

Another think I chatted about with Aaron recently is what I think might be a useful feature for the model to include, and which is certainly one of the things that manual assessments of article quality often highlight: the number of sections that lack references. Clumps of text without references is an important smell, even if the overall number of refs in the article is high.

I'm a bit pressed for time at the moment, so to prevent this from stalling I'd like to propose that a first priority is that I try to create a dataset that doesn't have any redirects in it. Given the low number of redirects we have in the dataset, I expect this problem to be minimal if I simply sample a few hundred extra articles in the classes where that is possible. I'll also make sure the dataset doesn't contain any disambiguation pages.

That should at least allow us to train a classifier that doesn't think redirects have a reasonable amount of quality.

If you can get me clean labeled data, I can get the model updated. No problem.

@Nettrom Let us know if you have the time to point us to cleaned data, thanks!

@awight : I was working on this yesterday, but didn't get the dataset ready overnight. The process I have goes as follows:

  1. Sample articles from each assessment class.
  2. Identify the revision where the assessment was applied.
  3. Grab the revisions.

The first stage is fairly fast and completed. Stage 2 started, but it seems the Toolforge database servers are a bit occupied at the moment so the process went much slower than expected and the script died after 12 hours without even completing the Featured Articles. I'll get the sample split up so it can be run in parallel and run it again, because I know that lower-quality articles tend to take much less time.

New dataset has now been uploaded to figshare. If this direct link does not work, use this DOI link and download the "2017_english_wikipedia_quality_dataset.tar.bz2" file.

While I have not gotten around to implement all our data cleaning strategies and created beautifully even-numbered datasets of articles, the following changes have been used to improve the most recent dataset:

  1. Category:All article disambiguation pages has been used to identify disambiguation pages before sampling articles from a given assessment class. This removed a few more pages than just search for "disambiguation" in page titles.
  2. Articles associated with talk pages in categories starting with "List-Class" have been used to remove lists. This removed several hundreds of list-type pages before sampling articles with a given rating.
  3. The downloaded revision data has been searched to identify redirects, which have then been removed.

The format of the TSVs has changed slightly. Instead of just listing page ID, revision ID, and assessment rating, they now list the associated talk page ID and talk page revision ID as well. This should allow for further transparency and ability to check whether our data gathering strategies work, in case that is of interest.