Page MenuHomePhabricator

Create commons/wikidata dataset for MediaSearch
Open, HighPublic

Description

As a search engineer I want a dedicated dataset with the wikidata entities referenced from commons so that requests do not have to be made to wikidata directly

Commons and wikidata RDF data are available in a hive table.

Create a spark job in wikidata/query/rdf/rdf-spark-tools that pulls all wikidata items linked from a mediainfo item using the property P180 or P6243 with the following data:

  • item
  • labels
  • aliases
  • descriptions
  • P31 (instance of)
  • P171 (taxon)

Example for Q42:

{
TODO
}

The resulting dataset should be available in a hive table for downstream operators.
Hive table: discovery.mediasearch_entities
HDFS folder: hdfs:///wmf/data/discovery/mediasearch_entities
Schedule: should probably be rebuilt as soon as the commons mediainfo RDF dump is processed

AC:

  • a new spark job in wikidata/query/rdf/rdf-spark-tools
  • a new dag in airflow to schedule this new job