Re-use T293878: [L] Gather labeled data relevant to synonyms and build a dataset to update MediaSearch's index.
Implementation pointers at T293878#7825473.
Data
- confirm sampling of search queries based on traffic, e.g., https://github.com/marfox/image-search
- curate queries
- look into languages
- make sure enough ones activate all code paths, e.g., Wikidata query expansion on labels / aliases / depicts
- including queries that active code paths in different ways, e.g. for "Martin Luther King Jr" 4 words correspond to 1 wikidata item, but "toucan in amazon rainforest" has 2 different items in the 4 words (actually not sure how we handle this in the backend)
- including queries that, if used on wikipedias, would give us "fair use" results - e.g. "bruce springsteen live" might give this image which is not on Commons but is on enwiki
- ensure positive and negative image samples by setting reasonable top-K results, e.g.,50 is more likely to include negative rather than 5
- ensure at least N positive/negative samples per query
- confirm optimization metric, e.g., nDCG - focus on precision@K and avoiding zero-results (i.e., recall)
- what can zero-results tell us?