Page MenuHomePhabricator

Generate inputs for 1st sensemaking session about ORES quality score distributions across the Wikidata classes
Open, Needs TriagePublic

Description

To understand the distribution of quality in Wikidata, we need to improve upon the existing ORES quality datasets by joining in even more information than re-use statistics and the number of items per quality class. In particular, we need to understand the distribution of the ORES quality scores across the content in Wikidata. To establish such a distribution, we will be joining in data on set relations and mereological relations from the Wikidata JSON dump to the ORES quality prediction scores that we already produce and use in our analytics.

The goal is to provide a set of actional insights that could be shared with the community on what classes are critical in terms of item quality and where the improvements are necessary. We hope to be also able to derive a more strategic insight into the possible future evolution of item quality in Wikidata given its current state that we want to establish in this ticket.

Based on the Wikidata ORES Quality Report in Wikidata Analytics:

  • update the ORES quality prediction scores,
  • update the Wikidata ORES Quality Report in Wikidata Analytics,
  • get all P31, P279, and P361 classes of the items for which we have ORES prediction scores,
  • establish the quality score distributions per class,

Acceptance criteria:

We have the first inputs for our next sensemaking session (where we will decide on the next steps):

  • first exploratory data analysis
  • including a shareable dataset
  • ideas about possible next steps (towards a better understanding of the current distribution of the ORES quality scores across Wikidata’s classes)

Next iteration:

  • P31, P279, and P361 ?
  • ORES per class in Human vs Bots Statistics
  • add other predictors for ORES

Event Timeline

Manuel renamed this task from Establish the ORES quality score distributions across the Wikidata classes to Generate inputs for joint sensemaking session about ORES quality score distributions across the Wikidata classes.Jun 29 2021, 7:46 AM
Manuel updated the task description. (Show Details)

@Ladsgroup Hey, the latest Wikidata ORES quality snapshot in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_QualitySnapshots/ is

  • wikidata_quality_snapshot_202009.tsv.gz, produced on 2020-10-13 00:00.

We would need a fresh update of ORES scores for Wikidata for this task. Is it doable? Thanks.

It is in my directory in stat machine, I don't know why it's not moved over. I'll look into it.

@Ladsgroup

It is in my directory in stat machine, I don't know why it's not moved over. I'll look into it.

Thank you!

@Manuel

  • Fresh ORES quality scores (2021/06) are obtained from Amir;
  • An update of the fundamental Wikidata ORES Quality dataset (items x quality score + reuse + latest revision) is produced;
  • An items x P31|P279|P361 classes dataset is under production;
  • Next steps: (1) join items x scores x classes (possibly long data representation/a nasty join; either Apache Spark or {data.table} in RAM operation from the Analytics Clients) → (2) classes x scores (wide data representation) → (3) pick a clustering procedure → (4) cluster (again Apache Spark or in RAM R procedure).

Update 20210630

  • join items x scores x classes: done
    • all items with missing ORES predictions were filtered out;
    • all duplicated set theoretic/mereological relations were singled out (e.g. if an item refers to a class via both P31 and P279, or by both P31 and P361, then we count that item's contribution to the overall class quality as one contribution and not two contributions);
    • we are talking about 80,236,080 items assigned to 472,035 classes under analysis.

Next steps:

(2) classes x scores (wide data representation: this might not be necessary, depends upon the decision in (3)) →
(3) decide upon a clustering procedure →
(4) cluster (either Apache Spark MLlib or an in RAM R procedure from the Analytics Clients).

Hi Goran, thx for the update! What would you cluster by? What additional information could we join in? (I was thinking about some user and or edit data like last edited, number of unique users, number of edits etc that could give meaningful clusters.)

Maybe let's quickly talk about this in our 1:1?

@Manuel

Maybe let's quickly talk about this in our 1:1?

Of course.

What would you cluster by?

Well, I guess in the beginning it would only be a matrix of (1) Wikidata classes x (2) the counts of ORES A, B, C, D, E scored items per class. That would be the most straightforward exploration of the distribution of ORES quality scores across the classes, and it would help us pile up at least some of those half million classes together in (hopefully) meaningful groups : )

What additional information could we join in? (I was thinking about some user and or edit data like last edited, number of unique users, number of edits etc that could give meaningful clusters.)

All that you are saying makes sense, except for that I would not go for solving a more complicate problem (ORES scores + additional information on Wikidata classes --> clusters) before the already very complicated problem (ORES scores --> clusters) is solved. As I hope to be able to explain in our 1:1 today, clustering 472,035 Wikidata classes across five simple integer observations (A, B, C, D, E) already presents a challenge. So my suggestion would be to smart small.

@GoranSMilovanovic: Agreed! Let's make one step after the other! :)

@Manuel

(1) classes x scores (wide data representation) →we have this data representation now

(2) Let's make a choice of a clustering algorithm, candidate no.1: K-means in Apache Spark's MLlib.

@Manuel As we agreed in our 1:1 today:

  • prioritizing Exploratory Data Analysis/Hypothesis Generation over clustering;
  • let's first see what insights can we have before making any modeling assumptions.
Manuel renamed this task from Generate inputs for joint sensemaking session about ORES quality score distributions across the Wikidata classes to Generate inputs for sensemaking session about ORES quality score distributions across the Wikidata classes.Jul 1 2021, 7:29 AM
Manuel renamed this task from Generate inputs for sensemaking session about ORES quality score distributions across the Wikidata classes to Generate inputs for 1st sensemaking session about ORES quality score distributions across the Wikidata classes.
Manuel updated the task description. (Show Details)

@Manuel

Please take a look at the following report if you find some time before our 1:1 at 14:30 CET today:

I will give you a walk-through this material + help you single out the most important findings (they are not numerous, as you will see).

I will also see to find some time for post hoc analyses following our 1:1.

@Manuel

Here is my current take on

ideas about possible next steps (towards a better understanding of the current distribution of the ORES quality scores across Wikidata’s classes)

  • Gather potential explanatory variables and model their influence upon ORES scores, e.g. number of edits per WD class, proportion of human vs bot edits, how frequently were the class items revised...
  • Separate quality assessment for (a) classes that were predominantly edited by bots vs classes that were predominantly edited by editors, and maybe (b) classes that predominantly result as consequences of mass imports vs "spontaneously grown" classes?
  • Describe clusters of Wikidata classes by higher level classes in the ontology (i.e. what is found in their P31/P279 paths towards entitity), but this might be tricky to obtain
  • I was never able to figure out precisely the size and characteristics of the ORES training set for Wikidata; however, I wonder if training separate quality models (via boosted trees, as in ORES, or otherwise) for different large Wikidata classes would make more sense than training a model to predict a quality of just any item in some general framework

@Ladsgroup @Tobi_WMDE_SW Could you please help me find the exact training set used to tune ORES for Wikidata items? I would need to develop a detailed understanding of the feature engineering process in the first place.

I mean exactly the dataset produced by the campaign found on https://ores-support-checklist.toolforge.org/ and from there https://labels.wmflabs.org/stats/wikidatawiki/81 where we see that the data collection has started in 2018, expecting to collect 7079 labels and having reached 5200 until now - or if there is a newer training labels dataset I would also appreciate to know where it lives. Thank you.

@Manuel

For our 1:1 this morning, an updated report, and as discussed in our previous 1:1:

  • section 2.5 ORES quality in Human (Q5),
  • section 2.7 The distribution of ORES scores in the remaining Wikidata classes (Wikidata - (Astronomical Object + Scholarly Article): including only classes w. >= 1000 items
  • section 2.8 The distribution of ORES scores in the remaining Wikidata classes (Wikidata - (Astronomical Object + Scholarly Article): including only classes w. < 1000 items

  • Next step: ORES per class in Human vs Bots Statistics.

@Manuel

  • A new dataset is produced, encompassing the following fields:
  • class: a Wikidata class
  • num_items: number of items in the class (via instanceOf, subclassOf, or partOf)
  • avg_score: the average ORES score in this class (A = 5, B = 4, C = 3, D = 2, E = 1)
  • med_score: the median ORES scores in this class
  • sum_reuse: the sum of WDCM re-use statistics for all items in this class (the class "total reuse")
  • avg_reuse: the average of WDCM re-use statistics for all items in this class (the class "mean reuse")
  • med_reuse: the median of WDCM re-use statistics for all items in this class (the class "median reuse")
  • num_reused: the number of re-used items in this class
  • last_revision: the timestamp of the latest revision made on any item in this class
  • human_edits: the total number of human edits made on the items in this class
  • bot_edits: the total number of bot edits made on the items in this class.

All statistics are based on the latest available Wikidata JSON dump snapshots in hdfs and the latest snapshot of the wmf.mediawiki_history table.

The previously encountered discrepancy in the number of classes present in the data quality (ORES) dataset and the Human vs Bot edits dataset is resolved.

The dataset .csv is large and will be shared via Google Drive.