Page MenuHomePhabricator

Wikidata items touched by humans per class
Closed, ResolvedPublic

Description

@DarTar, in a recent Tweet:

Two @wikidata statistics I’d love to see:

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.
  1. the median number of unique human contributors per item by class.

Event Timeline

Sorry, I've managed to give you the answer via some other channel, my bad. Go is the answer!

Update Tue 26 May 2020 10:40:44 PM UTC:

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.

This is solved for:

  • % of items per class ever touched by humans;
  • ratio of human vs. bot edits per class;
  • ratio of human vs. bot edits per item.

Working on

  1. the median number of unique human contributors per item by class.

now.

Update Wed 27 May 2020 10:06:05 PM UTC

  1. the % of items *by class* that have been “touched” by humans (whether manually edited or via a tool) as opposed to items only touched by bots.

Datasets produced in hdfs, Pyspark.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Here we go:

  • the dataset:
    • wd_class: Wikidata class
    • label: label (en)
    • num_items: number of items in class
    • human_edited_items: number of items in class ever edited by a human editor
    • percent_items_touched:% of items in class ever edited by a human editor
    • median_unique_editors: median of the number of unique human editors per item, per class
    • human_edits: number of human edits made in this class
    • bot_edits: number of bot edits made in this class
    • total_edits: total number of edits made in this class
    • human_to_bot_ratio: number of human edits / number of bot edits
    • human_percent: % of edits ever made by human editors in this class
    • bot_percent: % of edits ever made by bots in this class

  • the distribution of the median number of human editors per item, per class:

  • the distribution of human to bot edit ratio:

The dashboard (prototype) is going online soon.

Note. Everything is based on the 2020-05 snapshot of the wmf.mediawiki_history table, and the 2020-06-15 hdfs version of the Wikidata JSON dump (see: wmf.wikidata_entity table).

GoranSMilovanovic lowered the priority of this task from High to Medium.Jul 3 2020, 2:46 PM

And the dashboard is live: http://wmdeanalytics.wmflabs.org/WD_HumanEdits/
We still have to see how often to update the dataset.

@Lydia_Pintscher Please review and let's see if there is anything else that needs to be added or changed here. Thank you!

\o/
Thank you! This looks good.
I have a few questions about the results since they are quite different from what I at least had expected (or I'm interpreting the charts wrong). I'm intrigued. I'll set up a call.

@Lydia_Pintscher @GoranSMilovanovic

  • Opening the ticket:
  • check for the distribution of human vs. bot users in respect to the tools of batch imports possibly used by the community

Revision examples:

Focus on:

  • Mix'n'match - check what can be categorized as a human edit,
  • The Wikidata Game - that's a human edit too,
  • but we are pretty sure that these two are tagged as human edits;
  • we are unsure whether Quickstatements are tagged as bot or human, while they should be tagged as bot.

@Lydia_Pintscher

In the wmf.mediawiki_history table:

  • the revision_tags field was not of much help to resolve our dilemma;
  • I've rather checked the event_user_text field (i.e. the user names) and compared against the respective human/bot tags:
    • Quickstatements - revisions are tagged as a bot revisions indeed (QuickStatementsBot).

Now, what I could not check in the table but learned from documentation:

  • Mix'n'match - from https://meta.wikimedia.org/wiki/Mix%27n%27match/Manual: "Whenever you make a connection between a catalogue entry and a Wikidata item, the system will automatically update Wikidata. This will show up as an edit in your contributions." - so Mix'n'match are human revisions;
  • Wikidata game: from https://wikidata-game.toolforge.org/: "The games will edit Wikidata for you, under your user name, via OAuth." - so games are considered as human revisions too.

That would imply that what we see on my dashboard - http://wmdeanalytics.wmflabs.org/WD_HumanEdits/ - is a realistic picture indeed.

However, what was confusing on the dashboard was the description below the Proportion of items ever touched by Human Editors chart, which now states:

The Distribution of the proportion of items ever touched by Human Editors. This is the distribution of the proportion of items in a given class that were ever edited by a Human editor, per Wikidata class. The value ranges between 0 and 1. It is obtained in the following way: we count the number of items in a Wikidata class that were ever edited by a Human editor, and then divide that number by the total number of items in that class.

@Lydia_Pintscher

In the wmf.mediawiki_history table:

  • the revision_tags field was not of much help to resolve our dilemma;
  • I've rather checked the event_user_text field (i.e. the user names) and compared against the respective human/bot tags:
    • Quickstatements - revisions are tagged as a bot revisions indeed (QuickStatementsBot).

Right. But not all quickstatements edits are done through this bot. It can also run in a mode where the edits are made under the editor's account. One example would be this edit: https://www.wikidata.org/w/index.php?title=Q1675403&diff=prev&oldid=1256890335 And I _think_ then it's not counted as a bot edit.

Now, what I could not check in the table but learned from documentation:

  • Mix'n'match - from https://meta.wikimedia.org/wiki/Mix%27n%27match/Manual: "Whenever you make a connection between a catalogue entry and a Wikidata item, the system will automatically update Wikidata. This will show up as an edit in your contributions." - so Mix'n'match are human revisions;
  • Wikidata game: from https://wikidata-game.toolforge.org/: "The games will edit Wikidata for you, under your user name, via OAuth." - so games are considered as human revisions too.

Yeah and I think those are fine and shouldn't have a major influence one way or another.

That would imply that what we see on my dashboard - http://wmdeanalytics.wmflabs.org/WD_HumanEdits/ - is a realistic picture indeed.

However, what was confusing on the dashboard was the description below the Proportion of items ever touched by Human Editors chart, which now states:

The Distribution of the proportion of items ever touched by Human Editors. This is the distribution of the proportion of items in a given class that were ever edited by a Human editor, per Wikidata class. The value ranges between 0 and 1. It is obtained in the following way: we count the number of items in a Wikidata class that were ever edited by a Human editor, and then divide that number by the total number of items in that class.

Cool! So looking at the chart then I take from it: For most classes almost all Items have had at least one human edit. For a small number of classes most Items did not ever get touched by a human. Correct?

@Lydia_Pintscher

Cool! So looking at the chart then I take from it: For most classes almost all Items have had at least one human edit. For a small number of classes most Items did not ever get touched by a human. Correct?

Correct.

@Lydia_Pintscher What is the status of this ticket - do we need anything else here?

No I think we can close this \o/