Page MenuHomePhabricator

Wikidata items touched by humans per class
Open, MediumPublic

Description

@DarTar, in a recent Tweet:

Two @wikidata statistics I’d love to see:

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.
  1. the median number of unique human contributors per item by class.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2018, 10:27 PM

@Lydia_Pintscher @WMDE-leszek So, what is your take on this: go or no go?

Sorry, I've managed to give you the answer via some other channel, my bad. Go is the answer!

GoranSMilovanovic triaged this task as High priority.May 20 2020, 7:13 PM

Update Tue 26 May 2020 10:40:44 PM UTC:

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.

This is solved for:

  • % of items per class ever touched by humans;
  • ratio of human vs. bot edits per class;
  • ratio of human vs. bot edits per item.

Working on

  1. the median number of unique human contributors per item by class.

now.

Update Wed 27 May 2020 10:06:05 PM UTC

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.

Datasets produced in hdfs, Pyspark.

Aklapper removed GoranSMilovanovic as the assignee of this task.Jun 19 2020, 4:17 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

GoranSMilovanovic added a comment.EditedJul 1 2020, 12:12 AM

Here we go:

  • the dataset:
    • wd_class: Wikidata class
    • label: label (en)
    • num_items: number of items in class
    • human_edited_items: number of items in class ever edited by a human editor
    • percent_items_touched:% of items in class ever edited by a human editor
    • median_unique_editors: median of the number of unique human editors per item, per class
    • human_edits: number of human edits made in this class
    • bot_edits: number of bot edits made in this class
    • total_edits: total number of edits made in this class
    • human_to_bot_ratio: number of human edits / number of bot edits
    • human_percent: % of edits ever made by human editors in this class
    • bot_percent: % of edits ever made by bots in this class

  • the distribution of the median number of human editors per item, per class:

  • the distribution of human to bot edit ratio:

The dashboard (prototype) is going online soon.

Note. Everything is based on the 2020-05 snapshot of the wmf.mediawiki_history table, and the 2020-06-15 hdfs version of the Wikidata JSON dump (see: wmf.wikidata_entity table).

GoranSMilovanovic lowered the priority of this task from High to Medium.Jul 3 2020, 2:46 PM

And the dashboard is live: http://wmdeanalytics.wmflabs.org/WD_HumanEdits/
We still have to see how often to update the dataset.

@Lydia_Pintscher Please review and let's see if there is anything else that needs to be added or changed here. Thank you!

Lydia_Pintscher closed this task as Resolved.Fri, Jul 24, 8:46 AM

\o/
Thank you! This looks good.
I have a few questions about the results since they are quite different from what I at least had expected (or I'm interpreting the charts wrong). I'm intrigued. I'll set up a call.

GoranSMilovanovic reopened this task as Open.EditedMon, Jul 27, 12:19 PM

@Lydia_Pintscher @GoranSMilovanovic

  • Opening the ticket:
  • check for the distribution of human vs. bot users in respect to the tools of batch imports possibly used by the community

Revision examples:

Focus on:

  • Mix'n'match - check what can be categorized as a human edit,
  • The Wikidata Game - that's a human edit too,
  • but we are pretty sure that these two are tagged as human edits;
  • we are unsure whether Quickstatements are tagged as bot or human, while they should be tagged as bot.

@Lydia_Pintscher

In the wmf.mediawiki_history table:

  • the revision_tags field was not of much help to resolve our dilemma;
  • I've rather checked the event_user_text field (i.e. the user names) and compared against the respective human/bot tags:
    • Quickstatements - revisions are tagged as a bot revisions indeed (QuickStatementsBot).

Now, what I could not check in the table but learned from documentation:

  • Mix'n'match - from https://meta.wikimedia.org/wiki/Mix%27n%27match/Manual: "Whenever you make a connection between a catalogue entry and a Wikidata item, the system will automatically update Wikidata. This will show up as an edit in your contributions." - so Mix'n'match are human revisions;
  • Wikidata game: from https://wikidata-game.toolforge.org/: "The games will edit Wikidata for you, under your user name, via OAuth." - so games are considered as human revisions too.

That would imply that what we see on my dashboard - http://wmdeanalytics.wmflabs.org/WD_HumanEdits/ - is a realistic picture indeed.

However, what was confusing on the dashboard was the description below the Proportion of items ever touched by Human Editors chart, which now states:

The Distribution of the proportion of items ever touched by Human Editors. This is the distribution of the proportion of items in a given class that were ever edited by a Human editor, per Wikidata class. The value ranges between 0 and 1. It is obtained in the following way: we count the number of items in a Wikidata class that were ever edited by a Human editor, and then divide that number by the total number of items in that class.