Page MenuHomePhabricator

Wrong search result for MediaSearch in Commons, used in Wikistories
Closed, InvalidPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:
Some of the search results include:

None of them have anything to do with Surakarta City, and nowhere in the title, description, meta, matches any resemblance with "Surakarta" or "City" ("Kota", in Indonesian).

I happened to found this bug when I was creating a Wikistory in Indonesian Wikipedia:
https://www.mediawiki.org/wiki/Topic:Wyh8f0obblz9wdd9

I suspect this has to do that https://id.wikipedia.org/wiki/Surakarta is colloquially known as "Solo" for short. But where does this information seeped into the search result?

What should have happened instead?:

  • Nothing unrelated to "Kota Surakarta" should be displayed
  • Images with titles, description, category (and subcat), matching the search term should have greater weight, and displayed on the top, therefore, images with neither title, description, category (and subcat), matching the search term should be pushed to the very last of search result.
  • Ideally, images should be sorted by usages in projects. Greater usages = better quality image (hopefully). Other consideration for extra weight would be: Picture of the Day status, title matching exactly the search term, how old is the file, multiple occurences of the search terms in the title and description and categories, whether the image is in the top category or way deep in the subcategories.
  • Images in the category with Exact match as the search term should be displayed

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks for raising this issue. We do media searches search based on (in descending order of how we weight each search field):

  • any "depicts" statements an image has
  • the image title
  • the image category name (not subcategories)
  • the image captions in the user's interface language
  • if there's a redirect to the image, then the title of the redirect
  • the text on the page

We search for the search terms the user has entered, plus any synonyms we can find, giving the synonyms half the weight of the user-entered search terms. Note that the scores for individual search terms depend not only on the weights we've given to the fields, but also on how often a search term appears within the fields we're searching, and how often it appears anywhere on Commons (see en:Okapi BM25 if you want to find out more).

Then once we have all the results, we rescore the top ~8k results, giving an extra boost to any files with templates like Quality image, Valued image, etc. The weights for the search terms are a result of optimising for "good" results coming before "bad" results from a training dataset of 14k search results in 20 languages.

Just going through some search results for "kota surakarta":

  • The first image matches on "depicts" (see in the "structured data" tab), title, category, caption (because the user's interface language for this search is English), and page text
  • The second image matches title, category name, and the title of a redirect to the file
  • The third image matches title, category name, and page text
  • The fourth image matches the search term and a synonym in the title, search term and a synonym in category names, and the page text
  • The first obviously wrong image we can see is File:Solar_Solo.jpg - this matches a synonym ("solo") on the title, category and redirect
  • Similarly the chainsaw image File:Solo-645.jpg matches a synonym on title and category

Adding synonym searching has massively improved our search "recall" (the amount of results returned) for non-English languages, especially less widely-spoken ones - before we introduced it when you searched for (for example) "íaltog" you got very few results, but now the search system knows that "íaltog" is the Irish for "bat", and so can return images with "bat" in the title, or in the category "bats". It may be that synonym searching has reduced our "precision" (meaning the proportion of the total results that are good matches), and the data we have used to build our model is insufficient to show that up. We can try and gather more data, and experiment with different weights for synonyms - any more examples like this that you might have will be very helpful, but please keep in mind that we're balancing making things better for some use cases against making them worse for others.

In the meantime, we also have recently imported a large new dataset based on links between Wikidata and Commons images that we hope will improve our precision. Mind that it won't necessarily exclude the "bad" images you have found, but it should ensure that better images appear sooner in the list of results. We're hoping to start using this new dataset in Media Search within the next couple of weeks.

P.S. We haven't found that whether an image is used on-wiki is a good signal for scoring how good a match an image is. A good illustration of why it hasn't turned out to be useful as a primary search signal is this file - it's used in lots of pages, including the Albert Einstein and Arthur Eddington pages on many wikis, but is not a good match for a search for "Albert Einstein" or "Arthur Eddington". It might turn out to be useful as a secondary signal in the same way that the templates are, but we haven't evaluated that so far.

As you seem to have discovered, a lone "Solo" was an English alias.
You have since removed it, but it is still in the synonyms cache (for 1 day, so this should no longer be an issue tomorrow)
After the cache clears (and assuming aliases on Wikidata remain what they are now), "solo city" (as exact phrase match) is going to be the synonym that'll be used.

Thanks for the detailed explanation, Sannita. And yes, matthias, I didn't find an immediate change yesterday, so I checked again today, and the result is way better now. Feel free to close this.