Page MenuHomePhabricator

Finding items on Wikidata that should be merged
Open, NormalPublic

Description

There are many items on Wikidata that should be merged. Help us find the most likely candidates so editors have an easier time going through them. This can then be used as input for Magnus' merge game for example.

Info:
If you are interested in working on this at the Wikimedia-Hackathon-2018 , @Ladsgroup can help you with any questions :-)

Event Timeline

samuwmde created this task.Feb 19 2016, 4:39 PM
Lydia_Pintscher removed johl as the assignee of this task.Feb 19 2016, 5:10 PM
Lydia_Pintscher added a project: Wikidata.
Lydia_Pintscher removed a subscriber: hoo.
Lydia_Pintscher renamed this task from Finding items that should be merged (general problem -- find missing connections): to Finding items that should be merged.Feb 19 2016, 5:30 PM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher updated the task description. (Show Details)

Thanks for the links, folks!

Lydia_Pintscher triaged this task as Normal priority.
Halfak moved this task from Untriaged to Ideas on the Scoring-platform-team board.
johl removed a subscriber: johl.Jun 14 2016, 1:27 PM
Lydia_Pintscher renamed this task from Finding items that should be merged to Finding items on Wikidata that should be merged.Apr 7 2018, 11:30 AM
Bmueller updated the task description. (Show Details)May 16 2018, 12:14 PM
Bmueller added a subscriber: Ladsgroup.
Lahi added a subscriber: Lahi.May 18 2018, 1:15 PM
mforns added a subscriber: mforns.May 23 2018, 3:18 PM

Hey!

As @Ladsgroup knows, I worked on this task during the BCN Hackathon.
It was super-interesting and I learned a lot about Wikidata :]
Thanks for the opportunity!
Here's a summary about what I did, issues I had, and next steps:

  • After a while of reading docs and understanding basic stuff, I wrote a small bash script to extract Wikidata items from the dump in /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz, abridge its contents limiting them to: id, type, labels and sitelinks. And finally split them in 1M-lines files, to be processed in hdfs/hadoop in a distributed way. The script is:
nice -n19 ionice -c2 -n7 sh -c "zcat /mnt/data/xmldatadumps/public/wikidatawiki/entities/20180514/wikidata-20180514-all.json.gz | head -n -1 | tail -n +2 | sed 's/,$//' | jq -c 'select(.type == \"item\") | {id, labels: .labels | [keys[] as \$k | [\$k, .[\$k].value]], sitelinks: .sitelinks | [keys[] as \$k | [\$k, .[\$k].title]]}' | split -l 1000000 - ~/wikidata_items_abridged_20180514/part_"
  • Then, I compressed each file separately (hadoop can only distribute computation for compressed files, if they are compressed separately) and moved those to hdfs: /user/mforns/wikidata_items_abridged_20180514. Actually, I only moved 5 of the 49 files, to avoid computation of the whole data set while developing. But the rest are ready in stat1005:/home/mforns/wikidata_items_abridged_20180514 and can be copied over there any time.
  • I also wrote a spark/scala script that reads the item files in hdfs and processes them to find duplicate candidates. The logic identifies items that have identic labels for at least one language, or that have identic sitelinks for at least one site. Labels or sitelinks of different languages/sites are not compared. As this is executed in the cluster using spark RDDs (resilient distributed datasets), the algorithm can compare all Wikidata items against themselves and output a graph, where the vertices are item IDs (Q12345) and edges mean two vertices have identic labels/sitelinks. The weight of the edge corresponds to the number of matches in labels/sitelinks between both vertices (items). Here's the code:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession

type Item = (String, Map[String, String], Map[String, String])

def parseItems(
    sourceDirectory: String,
    spark: SparkSession
): RDD[Item] = {
    val schema = StructType(Seq(
        StructField("id", StringType, nullable = false),
        StructField("type", StringType, nullable = false),
        StructField("labels", ArrayType(ArrayType(StringType)), nullable = false),
        StructField("sitelinks", ArrayType(ArrayType(StringType)), nullable = false)
    ))
    val items = spark.read.schema(schema).json(sourceDirectory + "/*").rdd
    items.map(r => (
        r.getString(0),
        r.getSeq(2).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap,
        r.getSeq(3).asInstanceOf[Seq[Any]].map(e => e.asInstanceOf[Seq[String]]).map(e => e(0) -> e(1)).toMap
    ))
}

val items = parseItems("/user/mforns/wikidata_items_abridged_20180514", spark)

val expressions = items.flatMap { item =>
    (
        item._2.map(label => (label._1, label._2, item._1)) ++
        item._3.map(sitelink => (sitelink._1, sitelink._2, item._1))
    ).filter(e => e._2.size > 2)
}

val expressionGroups = (expressions
    .keyBy(e => (e._1, e._2))
    .groupByKey
    .map(g => (g._1, g._2.map(_._3).toSeq.sortBy(id => id)))
    .filter(g => g._2.size > 1))

val explodedEdges = expressionGroups.flatMap(g => g._2.combinations(2))

val weightedEdges = explodedEdges.keyBy(e => e).groupByKey.map(g => (g._1, g._2.size))

val edges = weightedEdges.filter(e => e._2 > 1)

edges.map(e => e._1(0) + "\t" + e._1(1) + "\t" + e._2).saveAsTextFile("/user/mforns/duplicate_candidates")

The output looks like this (you can access it in hdfs under /user/mforns/duplicate_candidates):

Q7545947	Q7545948	4
Q2581746	Q3779054	2
Q32850943	Q32851055	2
Q32498252	Q804060	2
Q4451724	Q4451776	5
...

Finally, I wrote a python script to read that output on a single machine and calculate the graph's connected components. I haven't tested it, but here it is:

import networkx as nx
import sys

G = nx.Graph()

with open(sys.argv[1], 'r') as input_file:
    for line in input_file:
        v1, v2, w = line.split(' ')
        G.add_edge(v1, v2, weight=w)

for component in nx.connected_components(G):
    print(component)

This should return all groups of items that are likely to be duplicates (same-label/sitelink duplicates, that is).

Issues

If you look at the duplicate_candidates files, you can quickly identify false positives. I found 2 types:

  • Disambiguation pages: They have the same label as their specific pages, and thus are identified as duplicates, but they are not. To fix this, we should look into the statements section of the item's data. However, that section was not in the abridged version of the data I was using, so I didn't work on this.
  • Different locations with the same name: I found for example Q19468507 and Q19468544 that have identic labels, but are different streets in the Netherlands. To fix this we also would need to look into statements (i.e. postal code).

Next steps

  • Reimport all data without abridging it. It's not so big for the hadoop cluster to handle. However, it must be split and compressed in chunks.
  • Modify the scala/spark code to consider statements (maybe also descriptions?)
  • If we reach a level where there's few enough false positives, we can productionize this and let it execute every week, with each Wikidata dump?

Cheers!

@mforns Nice work – thank you for sharing! Another exception that should really be subtracted to avoid false-positives: If there is a P1889 (different from) linking two items. this will eliminate only a much lower number of false-positives – but a set of very important ones: items that also look identical to humans if not paying enough attention!

@MichaelSchoenitzer_WMDE
Oh, cool. Yea, definitely useful. Thanks!

Harej moved this task from Ideas to Epic on the Scoring-platform-team board.Apr 3 2019, 4:21 AM