Page MenuHomePhabricator

Draw sample of Wikidata revisions for ORES training data
Closed, ResolvedPublic

Description

Problem:
The ORES tool for revisions helps Wikidata editors to maintain Wikidata's data. The tool could profit from more training data to stay up to date and make better predictions. The training data that we get through the labeling tool für ORES quality score is not sufficient.

Solution:
We have the opportunity to get 2.000 new revisions categorized. To make this happen we need to draw a sample of 10.000 revisions as described in the Sampling section below.

Sampling:

  • basic population: all revisions of the last 12 months (timespan of 1 year)
  • sample: n=10.000 randomly selected revisions of the basic population
  • data set: revision_oldid, item_qid, class_qid, class_en-label, user_isbot (i.e. in bots group)
  • file format: csv (or other sharable text based format)

Example:

revision_oldid, item_qid, class_qid, class_en-label, user_isbot
1540889995, "Q545724", "Q5", "human", false

Acceptance criteria:

Open questions:

  • What is the ORES classification of revisions based on? Is the oldid or the diff or something else? What does this mean for the data set we need?
  • What would be the optimal way to use these 2.000 human-categorized revisions? Can we focus on hard and useful stuff without jeopardizing the balance of ORES? Could we for example treat bot edits and massive classes like "academic papers" and "astronomical objects" differently? (e.g. segmented sample for external evaluation and self-evaluate the "easy" stuff)

Event Timeline

Manuel updated the task description. (Show Details)

What would be the optimal way to use these 2.000 human-categorized revisions? Can we focus on hard and useful stuff without jeopardizing the balance of ORES? Could we for example treat bot edits and massive classes like "academic papers" and "astronomical objects" differently? (e.g. segmented sample for external evaluation and self-evaluate the "easy" stuff)

Yes, we can just skip edits by bots, and instances of scholarly articles and astronomical objects. Because they are not what we are looking for when we want ORES to evaluate human edits. This would be different if our goal were improving the model for Item quality (though even then there would be ways to optimize things).

That being said, after the revisions haven been scored by humans, we can still significantly improve the amount of training data we have by using self-training: training ORES with the data, letting it score more revisions, and then adding the revisions where it is very confident to the training data and repeating that process as often as needed.

At least that is what I learned from @Ladsgroup 🙇😊

Yup but just make sure you add bot edits to the original training model (while labeling them as good automatically). Otherwise the model that haven't seen those edits would think they are also vandalism. As an analogy, if you exclude all of basketball pictures in a image classification model, the final model might mistake them as orange because it haven't seen a basketball.

I wrote a one-off Python script for this which I ran on stat1007: It used a SQL query to get 10k random revisions from the last year (that were on non-redirect items). The "classifying" is a little hacky here, but probably good enough (also this is impossible to get fully right).
Script:


Results:

The script looks alright to me – I remember reading something about how ORDER BY RAND() isn’t an ideal way to shuffle a collection (especially depending on the sorting algorithm), but it’s probably good enough here.

The script looks alright to me – I remember reading something about how ORDER BY RAND() isn’t an ideal way to shuffle a collection (especially depending on the sorting algorithm), but it’s probably good enough here.

Sadly, AFAIK, MariaDB has now real sampling options (TABLESAMPLE is not implemented), so this is the only thing that came to my mind.

An (easy to implement) alternative that I can think of, that will work with this many rows, would be to just pick random revision ids in the range (and obviously discard everything that doesn't fit our criteria) up until we collected 10k valid revisions.

Thank you!

@Lucas_Werkmeister_WMDE: Is it fair to assume that the randomization is in this case random enough for our purposes?

According to the conversation, I would suggest the following:

  1. Let's do the manual evaluation only for edits that were done without a bot flag. This would mean that before training ORES we would have to add those to the dataset (and give them a blanco positive evaluation).
  1. Let's do the manual evaluation for big classes that are mainly edited by tools only with a reduced resolution. This might reduce the quality of their ORES evaluations later on, but I hope it should not be an issue. We could also combine this with a similar approach as described in 1.

This would give us:

0 edits for isbot = true

30 edits each for

  • scholarly articles
  • stars
  • Wikimedia category
  • Wikimedia disambiguation page

1880 random edits from other classes

Result:

What do you think?

@Lucas_Werkmeister_WMDE: Is it fair to assume that the randomization is in this case random enough for our purposes?

I think so, yes.