Page MenuHomePhabricator

Deduplicate triples when loading the wikibase RDF dumps into hive
Closed, ResolvedPublic

Description

As a user of the wikidata triples database available in hive I want to have all the triples to be unique so that analysis are more accurate.

Due to how the RDF dumps are generated they may contain duplicates. After discussing with @JAllemandou we agreed to do the deduplication early when importing the data.

AC:

  • all the triples are unique

Event Timeline

dcausse renamed this task from Deduplicate tiples when loading wikibase RDF dataset into hive to Deduplicate triples when loading the wikibase RDF dumps into hive.Jul 12 2021, 7:17 AM
CBogen triaged this task as High priority.Jul 22 2021, 1:39 PM

Change 707051 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Deduplicate wikidata triples

https://gerrit.wikimedia.org/r/707051

Change 707051 merged by jenkins-bot:

[wikidata/query/rdf@master] Deduplicate wikidata triples

https://gerrit.wikimedia.org/r/707051

Joseph will suggest an optimization to this task when he is back. For now a simple .distinct() has been done on Spark dataframe to facilitate analysis on Wikidata dumps.

Change 708282 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] bump rdf-spark-tools to 0.3.77

https://gerrit.wikimedia.org/r/708282

Change 708282 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] bump rdf-spark-tools to 0.3.77

https://gerrit.wikimedia.org/r/708282