Adding synonyms to MediaSearch (see T258053 [1]) has greatly improved recall (i.e. the number of relevant results that get returned) for non-English languages. For example searching in Irish for "ialtóg" (bat) without synonyms gives 331 results, while search with synonyms gives ~3800 results.
Our existing labeled data shows a minor bump in search performance when synonyms search is included, but because the labeled data is mostly for English search terms it's unlikely to capture the big difference synonyms make to non-English searches. We'd like to capture the improved recall in our labeled data - maybe a few thousand query/image/rating non-English datapoints. The simplest way to do this is:
- do a search with https://commons.wikimedia.org/w/index.php?search=YOUR_SEARCH_TERM&ns6=1&uselang=YOUR_LANGUAGE&mediasearch_synonyms
- copy/paste the urls of some good/bad matches (ignore indifferent, they're not v useful) into https://media-search-signal-test.toolforge.org/bulk.html
- tag your data with synonyms
Not sure how to decide on which search terms to use:
- perhaps use https://trends.google.com/trends/yis/2021/ to find popular ones
- Google image search queries (more specific, sort by top instead of rising)
[1] There's currently a problem with response times with synonyms (see T293106), but let's ignore that for the purposes of this ticket