Problem:
The ORES tool for revisions helps Wikidata editors to maintain Wikidata's data. The tool could profit from more training data to stay up to date and make better predictions. The training data that we get through the labeling tool für ORES quality score is not sufficient.
Solution:
We have the opportunity to get 2.000 new revisions categorized. To make this happen we need to draw a sample of 10.000 revisions as described in the Sampling section below.
Sampling:
- basic population: all revisions of the last 12 months (timespan of 1 year)
- sample: n=10.000 randomly selected revisions of the basic population
- data set: revision_oldid, item_qid, class_qid, class_en-label, user_isbot (i.e. in bots group)
- file format: csv (or other sharable text based format)
Example:
revision_oldid, item_qid, class_qid, class_en-label, user_isbot 1540889995, "Q545724", "Q5", "human", false
Acceptance criteria:
- File and description shared with @Lydia_Pintscher
- Suggestion for segmentation (see open question) shared with @Lydia_Pintscher
Open questions:
- What is the ORES classification of revisions based on? Is the oldid or the diff or something else? What does this mean for the data set we need?
- What would be the optimal way to use these 2.000 human-categorized revisions? Can we focus on hard and useful stuff without jeopardizing the balance of ORES? Could we for example treat bot edits and massive classes like "academic papers" and "astronomical objects" differently? (e.g. segmented sample for external evaluation and self-evaluate the "easy" stuff)