Import wikidata RDF dump to hadoop
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dcausse
	Dec 19 2019, 10:02 AM

Description

We currently have no easy way to run large scale analysis on the wikidata graph. WDQS and blazegraph are not suited for this scenario. Hadoop seems to be a better fit. Discussing with @JAllemandou we believe that a simple parquet file with quads might be sufficient for now.

Details

	Subject	Repo	Branch	Lines +/-
	Add WikidataTurtleDumpConverter to rdf-spark-tools	wikidata/query/rdf	master	+356 -10

Customize query in gerrit

Event Timeline

• dcausse created this task.Dec 19 2019, 10:02 AM

Restricted Application added a project: Wikidata. · View Herald TranscriptDec 19 2019, 10:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Daniel_Mietchen subscribed.Feb 8 2020, 1:44 PM

• dcausse added a project: Discovery-Search (Current work).Feb 19 2020, 10:12 AM

Change 570324 had a related patch set uploaded (by DCausse; owner: Joal):
[wikidata/query/rdf@master] Add WikidataTurtleDumpConverter to rdf-spark-tools

https://gerrit.wikimedia.org/r/570324

gerritbot added a project: Patch-For-Review.Feb 20 2020, 8:44 PM

Change 570324 merged by jenkins-bot:
[wikidata/query/rdf@master] Add WikidataTurtleDumpConverter to rdf-spark-tools

https://gerrit.wikimedia.org/r/570324

Maintenance_bot removed a project: Patch-For-Review.Mar 24 2020, 2:10 PM

• dcausse closed this task as Resolved.Mar 24 2020, 2:50 PM

• dcausse assigned this task to JAllemandou.

• dcausse moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.