Deduplicate triples when loading the wikibase RDF dumps into hive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Jul 12 2021, 7:17 AM

Description

As a user of the wikidata triples database available in hive I want to have all the triples to be unique so that analysis are more accurate.

Due to how the RDF dumps are generated they may contain duplicates. After discussing with @JAllemandou we agreed to do the deduplication early when importing the data.

AC:

all the triples are unique

Details

	Subject	Repo	Branch	Lines +/-
	bump rdf-spark-tools to 0.3.77	wikimedia/discovery/analytics	master	+2 -2
	Deduplicate wikidata triples	wikidata/query/rdf	master	+56 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	AKhatun_WMF	T286436 Deduplicate triples when loading the wikibase RDF dumps into hive
Open	None	T289753 Optimize deduplication of triples when loading into wikibase RDF dumps
Open	None	T289754 Triple level deduplication

Event Timeline

dcausse created this task.Jul 12 2021, 7:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2021, 7:17 AM

dcausse renamed this task from Deduplicate tiples when loading wikibase RDF dataset into hive to Deduplicate triples when loading the wikibase RDF dumps into hive.Jul 12 2021, 7:17 AM

Maintenance_bot added a project: Wikidata.Jul 12 2021, 7:45 AM

MPhamWMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.Jul 12 2021, 3:30 PM

CBogen triaged this task as High priority.Jul 22 2021, 1:39 PM

Change 707051 had a related patch set uploaded (by AKhatun; author: AKhatun):

[wikidata/query/rdf@master] Deduplicate wikidata triples

https://gerrit.wikimedia.org/r/707051

gerritbot added a project: Patch-For-Review.Jul 23 2021, 4:35 AM

Change 707051 merged by jenkins-bot:

[wikidata/query/rdf@master] Deduplicate wikidata triples

https://gerrit.wikimedia.org/r/707051

Maintenance_bot removed a project: Patch-For-Review.Jul 26 2021, 11:10 AM

Joseph will suggest an optimization to this task when he is back. For now a simple .distinct() has been done on Spark dataframe to facilitate analysis on Wikidata dumps.

AKhatun_WMF claimed this task.Jul 26 2021, 11:24 AM

Change 708282 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] bump rdf-spark-tools to 0.3.77

https://gerrit.wikimedia.org/r/708282

gerritbot added a project: Patch-For-Review.Jul 27 2021, 12:25 PM

Change 708282 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] bump rdf-spark-tools to 0.3.77

https://gerrit.wikimedia.org/r/708282

dcausse mentioned this in rWDAN290c90f6afee: bump rdf-spark-tools to 0.3.77.Jul 27 2021, 12:52 PM

dcausse added a project: Discovery-Search (Current work).Jul 27 2021, 12:58 PM

dcausse moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

Maintenance_bot removed a project: Patch-For-Review.Jul 27 2021, 1:10 PM