Page MenuHomePhabricator

Improve/fix multilingualism on Commons' search
Closed, InvalidPublic


A few community members have shared difficulties interacting with the new Wikimedia Commons search (MediaSearch), especially regarding translation/multilingualism features to help find media files.

For example, when searching for "apple", the fruit, on Commons, the results in English are very similar to the results using the haswbstatement:P180=Q89 (depicts = apple). However, when doing the same search in other languages, such as German, Spanish, French, Portuguese, and others, the results are quite poor, even though they all have labels on Wikidata. Is there a way to improve or fix this?

Event Timeline

matthiasmullie subscribed.

AFAICT, nothing is really wrong here. There are a couple of things at play.

The example links for searches in other languages don't specify a language.
MediaSearch will use the user's interface language (usually/often English) as context to find Wikidata items, and figure out how much such an item is worth.
When searching for foreign words in an English context, we may either not even find those items with those labels in another language. and when we do (via a system of fallback languages), the items likely won't have much weight.
So: Wikidata items with a matching labels in a language that's not the user's Commons interface language will not be found, or be given much weight.
If we explicitly override the language for these searches, we see that the results immediately improve drastically: German, Spanish, French & Portuguese

Another important factor is "the rest of the content": titles, wikitext etc. We don't know what language they're in so they're all treated the same.
The overwhelmingly large majority of content is in English, though. This means that other (non-English) terms that occur in those fields are going to be more rare than the English equivalents.
Occurrences of rarer words are going to have a larger weight than common words.
Since this is all just 1 big blob of text which can contain any language (sometimes even multiple), we can't treat them any differently.

So basically: the terms were not being used in the correct context (which assumes another language) so the relevant items are not worth much. And then those matches have to compete with other matches that have higher-than-usual scores already because the other-language terms are more rare than their English equivalents.