As CirrusSearch maintainer I want MediaSearch to use a dedicated dataset built from wikidata that does not rely on the existing wikidata search APIs so that I can improve one without impacting the other.
Sub-tickets will be created as needed but the plan is roughly:
- import commons mediainfo dump to hdfs
- spark job that joins commons & wikidata and output a dedicated dataset for concept lookups
- determine the mapping, possibly experimenting with better techniques (not one field per language) to support multiple languages
- custom elasticsearch query to do query expansion&rewrite
- adapt mediasearch and replace the wikidata search API using query expansion
- optional but would be good to have: provide completion for wikidata items using this same dataset instead of using the wikidata completion API
AC:
- The MediaSearch query builder is no longer using the wikidata search API
- A single request is made to elastic