As a search engineer I want a dedicated dataset with the wikidata entities referenced from commons so that requests do not have to be made to wikidata directly
Commons and wikidata RDF data are available in a hive table.
Create a spark job in wikidata/query/rdf/rdf-spark-tools that pulls all wikidata items linked from a mediainfo item using the property P180 or P6243 with the following data:
Example for Q42:
{ TODO }
The resulting dataset should be available in a hive table for downstream operators.
Hive table: discovery.mediasearch_entities
HDFS folder: hdfs:///wmf/data/discovery/mediasearch_entities
Schedule: should probably be rebuilt as soon as the commons mediainfo RDF dump is processed
AC:
- a new spark job in wikidata/query/rdf/rdf-spark-tools
- a new dag in airflow to schedule this new job