Draw sample of Wikidata revisions for ORES training data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Manuel
	Dec 9 2021, 8:42 AM

Description

Problem:
The ORES tool for revisions helps Wikidata editors to maintain Wikidata's data. The tool could profit from more training data to stay up to date and make better predictions. The training data that we get through the labeling tool für ORES quality score is not sufficient.

Solution:
We have the opportunity to get 2.000 new revisions categorized. To make this happen we need to draw a sample of 10.000 revisions as described in the Sampling section below.

Sampling:

basic population: all revisions of the last 12 months (timespan of 1 year)
sample: n=10.000 randomly selected revisions of the basic population
data set: revision_oldid, item_qid, class_qid, class_en-label, user_isbot (i.e. in bots group)
file format: csv (or other sharable text based format)

Example:

revision_oldid, item_qid, class_qid, class_en-label, user_isbot
1540889995, "Q545724", "Q5", "human", false

Acceptance criteria:

File and description shared with @Lydia_Pintscher
Suggestion for segmentation (see open question) shared with @Lydia_Pintscher

Open questions:

What is the ORES classification of revisions based on? Is the oldid or the diff or something else? What does this mean for the data set we need?
What would be the optimal way to use these 2.000 human-categorized revisions? Can we focus on hard and useful stuff without jeopardizing the balance of ORES? Could we for example treat bot edits and massive classes like "academic papers" and "astronomical objects" differently? (e.g. segmented sample for external evaluation and self-evaluate the "easy" stuff)

Event Timeline

Manuel created this task.Dec 9 2021, 8:42 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2021, 8:42 AM

Manuel updated the task description. (Show Details)Dec 9 2021, 8:43 AM

Maintenance_bot added a project: Wikidata.Dec 9 2021, 8:46 AM

Manuel updated the task description. (Show Details)Dec 9 2021, 8:50 AM

Manuel moved this task from Incoming to Prioritized Backlog on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Dec 9 2021, 9:21 AM

Manuel moved this task from Prioritized Backlog to Incoming on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Dec 9 2021, 9:34 AM

Manuel moved this task from Incoming to Needs PM work on the Wikidata Analytics board.Dec 9 2021, 10:09 AM

Lucas_Werkmeister_WMDE updated the task description. (Show Details)Dec 9 2021, 1:03 PM

Manuel updated the task description. (Show Details)Dec 9 2021, 1:04 PM

Manuel updated the task description. (Show Details)

Lucas_Werkmeister_WMDE updated the task description. (Show Details)Dec 9 2021, 1:05 PM

Manuel updated the task description. (Show Details)Dec 9 2021, 1:07 PM

Michael claimed this task.Dec 9 2021, 1:11 PM

Restricted Application added a project: User-Michael. · View Herald TranscriptDec 9 2021, 1:11 PM

Manuel moved this task from Incoming to Prioritized Backlog on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Dec 9 2021, 1:11 PM

What would be the optimal way to use these 2.000 human-categorized revisions? Can we focus on hard and useful stuff without jeopardizing the balance of ORES? Could we for example treat bot edits and massive classes like "academic papers" and "astronomical objects" differently? (e.g. segmented sample for external evaluation and self-evaluate the "easy" stuff)

Yes, we can just skip edits by bots, and instances of scholarly articles and astronomical objects. Because they are not what we are looking for when we want ORES to evaluate human edits. This would be different if our goal were improving the model for Item quality (though even then there would be ways to optimize things).

That being said, after the revisions haven been scored by humans, we can still significantly improve the amount of training data we have by using self-training: training ORES with the data, letting it score more revisions, and then adding the revisions where it is very confident to the training data and repeating that process as often as needed.

At least that is what I learned from @Ladsgroup 🙇😊

Yup but just make sure you add bot edits to the original training model (while labeling them as good automatically). Otherwise the model that haven't seen those edits would think they are also vandalism. As an analogy, if you exclude all of basketball pictures in a image classification model, the final model might mistake them as orange because it haven't seen a basketball.

Michael removed Michael as the assignee of this task.Dec 27 2021, 11:46 AM

hoo claimed this task.Jan 11 2022, 11:42 PM

hoo moved this task from Prioritized Backlog to Doing on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.

hoo moved this task from Doing to Peer Review on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Jan 24 2022, 10:14 AM

I wrote a one-off Python script for this which I ran on stat1007: It used a SQL query to get 10k random revisions from the last year (that were on non-redirect items). The "classifying" is a little hacky here, but probably good enough (also this is impossible to get fully right).
Script:

T297347.py4 KBDownload

Results:

T297347.csv458 KBDownload

The script looks alright to me – I remember reading something about how ORDER BY RAND() isn’t an ideal way to shuffle a collection (especially depending on the sorting algorithm), but it’s probably good enough here.

In T297347#7648387, @Lucas_Werkmeister_WMDE wrote:

The script looks alright to me – I remember reading something about how ORDER BY RAND() isn’t an ideal way to shuffle a collection (especially depending on the sorting algorithm), but it’s probably good enough here.

Sadly, AFAIK, MariaDB has now real sampling options (TABLESAMPLE is not implemented), so this is the only thing that came to my mind.

An (easy to implement) alternative that I can think of, that will work with this many rows, would be to just pick random revision ids in the range (and obviously discard everything that doesn't fit our criteria) up until we collected 10k valid revisions.

Thank you!

@Lucas_Werkmeister_WMDE: Is it fair to assume that the randomization is in this case random enough for our purposes?

According to the conversation, I would suggest the following:

Let's do the manual evaluation only for edits that were done without a bot flag. This would mean that before training ORES we would have to add those to the dataset (and give them a blanco positive evaluation).

Let's do the manual evaluation for big classes that are mainly edited by tools only with a reduced resolution. This might reduce the quality of their ORES evaluations later on, but I hope it should not be an issue. We could also combine this with a similar approach as described in 1.

This would give us:

0 edits for isbot = true

30 edits each for

scholarly articles
stars
Wikimedia category
Wikimedia disambiguation page

1880 random edits from other classes

Result:

T297347_stratified_sample_v2.csv98 KBDownload

What do you think?

@Lucas_Werkmeister_WMDE: Is it fair to assume that the randomization is in this case random enough for our purposes?

I think so, yes.

Manuel closed this task as Resolved.Jan 25 2022, 4:59 PM

Manuel moved this task from Product Verification to Our work done on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.

	F34930481: T297347_stratified_sample_v2.csv
	Jan 25 2022, 5:56 PM

	F34929306: T297347.csv
	Jan 24 2022, 7:31 PM

	F34929305: T297347.py
	Jan 24 2022, 7:31 PM

Draw sample of Wikidata revisions for ORES training data Closed, ResolvedPublicActions

Description

Event Timeline

Draw sample of Wikidata revisions for ORES training data
Closed, ResolvedPublic
Actions