Page MenuHomePhabricator

[M] Create user-facing documentation for Media Search on Commons
Closed, ResolvedPublic

Description

This task is to user-facing create documentation for MediaSearch on Commons so that users understand how ranking works.

This should be a basic overview of what's included and how it contributes to the ranking.

The target audience is Wikimedians who are contributing files and want to make sure their files are ranked highly in search.

Event Timeline

CBogen renamed this task from Create documentation for Media Search on Commons to [M] Create user-facing documentation for Media Search on Commons.Sep 23 2020, 4:36 PM
CBogen updated the task description. (Show Details)

Below is an attempt at explaining what kind of data is used, how & why, without getting too technical.
If anything is too detailed or not detailed enough, or plain unclear, LMK and I'm happy to try to accommodate.
Feel free to edit in any way, and post wherever it may be relevant.


Commons multimedia search

TL;DR: to maximize the chances of files being found:

  • Add a relevant title
  • Add relevant captions in as many languages as possible, describing what the file is about
  • Add a detailed description, describing what the file is about and whatever other context is relevant
  • Add the file to the relevant categories
  • Add all depicts statements that you think your file is a relevant representation of (but no more)

Here's a simplified overview of the kind of data that is used, and in what way it contributes to finding files:

Full text search

How

This is traditional text-based search: if a text contains the words being searched for, the file matches.
The ranking is influenced in 2 ways:

  1. Frequency of terms

The search algorithm will try to estimate how relevant a result is based on the frequency of the search terms.

Simplified:

  • the more often the search terms occur in a document, the more relevant it appears to be (e.g. if a document mentions "Mona Lisa" more than another, it's likely more relevant)
  • the more often the search term occurs in all documents, the less relevant that term will be (e.g. commons words like "does" will not contribute much to the score because so many documents have that word)

For a "Mona Lisa" search term, this helps us discover that the "Mona Lisa" article (184 mentions of the term) is likely a better result than the "Louvre museum" article (7 occurrences.)
The problem with Commons, however, is that this relevance often doesn't mean as much when it comes to comparing relevance: these are not long article, but short descriptions: terms tend to occur not more than once or twice, and there is little other content to compare it against.

  1. Position of terms

There are multiple, all of which are used differently. They contribute to the final relevance score, but in a different way.
While a description can be long (and even contain multiple languages and "insignificant" (in terms of relevance to the search term) information), titles are usually short & highly specific.
Captions are somewhere in between, but have the added benefit of being multilingual. Even categories count!

Wikitext descriptions tend to be considered most important, but it contains so much information that significant terms often don't stand out as much there.
E.g. details like the author, the place or date that a media file was created or what museum it belongs to or what license it is published under - while important - are often not the terms that people will search for. Even significant parts of a description are simply "context", not the main subject.
Descriptions contain an awful lot of information that's often very important to even be able to find the file in the first place, but it's often hard to make out exactly what the file is about based on the terms in the description alone.

Additional data that describes things in a more succinct way (e.g. titles, captions, categories) will usually not contain more information than a description already does, but often focused on more highly specific information, which essentially helps determine what's important in a media file.
This basically means that - when searching for "Mona Lisa" - something that contains "Mona Lisa" in the description alone will usually matter less than one that also includes that term as part of the title and/or caption, and is added to (one of) the Mona Lisa categories.

Optimize

This is no argument for simply repeating the same information all over the place. The additional information in descriptions, for example, might also be important to be found for other search terms (e.g. "Da Vinci".)
This is also no argument for stuffing every field with as much information as possible, because that will lower the frequency-based relevance scores (as described above)
And trying to optimize for optimal placement on search terms that are not highly relevant is useless anyway - users will skip right past those results.

In short: the simplest way to optimize is simply to add as much data as possible in the way that they are intended. This is exactly what we try to optimize the search algorithm for.
Try to accurately describe the file in as many ways as possible. Add a relevant title, a detailed description, a caption (ideally in multiple languages), add the appropriate categories.

Caveats

The aforementioned full-text search algorithm is very good, but has some issues as well - especially in our context:

  1. Language

In a traditional text search, we likely don't want to see results in other languages than the one we're searching in (because we wouldn't understand it.)
That's different on Commons, because we're not really looking for the descriptions anyway - we want the file.
If we're searching for pictures of "car"s, ideally we also find that Dutch "auto" or that French "voiture".
But unless every image's descriptions/captions have translations for every language, we (sadly) can't find those unless you happen to search in the correct language.

An additional issue here is that while some words look the same in multiple languages, they may have different meanings. E.g. "gift" in English vs. German, or "chat" in English vs French.

  1. Synonyms

Similarly, when searching for a "bat", you're not going to find images where they're referred to by their scientific name: "Chiroptera". Or "NYC" and "New York".

  1. Word matches, not concepts

Similarly, a text description might contain a lot more implicit information that we simply cannot capture.
A "british shorthair" is also a "cat" and a "Volvo V40" is a "car", but unless their descriptions also explicitly mention "cat" or "car", they won't be found under those terms.

Statements

Wikidata statements have the potential of solving many of the aforementioned caveats of full text searches: they are multilingual, have aliases, and are linked to all sorts of related concepts.

How

Since the addition of the "Structured data" tab on file pages, it has been possible to attach Wikidata entities to a file, including statements about what the file "depicts."

Given a search term (e.g. "anaconda"), we'll also search Wikidata for relevant entities. In this case, here are some of the top results:
Anaconda (Q483539): town in Montana
Eunectes (Q188622): genus of snakes
Anaconda (Q17485058): Nicki Minaj song

In addition to full text matching, we'll also include results that have a "depicts" statement of (one or multiple of) these entities.
This has the potential of drastically expanding the amount of results returned, because entities already cover synonyms (via Wikidata aliases) and language differences (labels & aliases in multiple languages): a file only needs to be tagged with 1 depicts statement, and we'll be able to find that statement as any of it's already known aliases or translations.
And when translations or aliases get added to those entities later on, files tagged with them will automatically benefit from it by now being discoverable under those terms as well.

Note: not all entities are considered equal. When searching for "iris", users are likely expecting to find multimedia that depicts the genus of plants (Q156901), or maybe the part of an eye (Q178748), but probably not Iris Murdoch, the British writer and philosopher (Q217495).
Based on similarity to the search term & importance/popularity of the entity (as a proxy of how likely it is to be searching for that entity), we'll boost multimedia with certain entities more than others.

Caveats

Wikidata entities are an excellent signal to help discover additional relevant multimedia:

  • there is less noise (e.g. descriptions often contain false-positives like "iris" being the first name of the photographer, not the subject of the file)
  • they contain a lot more information (aliases & translations) than individual file descriptions ever can
  • they can be enriched in 1 central location

But they are a poor indicator for ranking:

  1. Relative relevance

It's impossible with statements to have any sense of relative importance of the statements.
In a statement that "depicts" multiple things, it's hard to know which the most important/relevant subjects are.
Are both equally important, or is one of them the obvious subject and the other an irrelevant small background detail? If so, which? Are they more relevant than in this other file?
Consider the "pale blue dot" photograph: even though the earth makes up less than a pixel in that image, it's a pretty damn significant feature of the photo.

Statements essentially only have 2 states: something is in the file, or it is not. There is no further detail about just how relevant something is in that file.
The "mark as prominent" could help a little bit, but 1/ that is used inconsistently (often only when there are multiple subjects) and 2/ still lacks a lot of nuance (that's still only 3 states.)

While depicts statements are tremendously useful in helping surface additional relevant results, it's hard to use it as a ranking signal: textual descriptions often convey relative importance of subjects better than these simple statements can.

  1. Level of detail

Wikidata has many entities at varying levels of details.

E.g. bridge (Q12280), suspension bridge (Q12570), Golden Gate Bridge (Q44440) or tourist attraction (Q570116) could probably all be used to describe a picture of the Golden Gate Bridge.
The Golden Gate Bridge (Q44440) statement already implies all of the others, though, via its various related entities.

However, there are examples where it's not this simple.
German Shepherd dog (Q38280) is a subclass of dog (Q144), which is a subclass of pet (Q39201) - in theory, we should be able to find pictured tagged with "German Shepherd dog" when one searches for "pet."
However, some photos tagged as "German Shepherd dog" as likely going to be of working dogs (Q1806324), not pets.

As a result, a case could be made to include statements for more generic terms as well, but that comes with its own set of problems: files of highly specific things are usually not relevant for their more generic terms.
Simply put: someone searching for "suspension bridge" probably isn't interested in an artsy landmark-focused photo of the Golden Gate Bridge. Even though it is a suspension bridge, that is an irrelevant detail for that photo.

While we are currently working towards being able to include "child concepts", it is essentially another example of why we must be careful in how much weight we give to certain entities; especially when compared to full text search.

Optimize

It's easy to fall into the trap of trying to add as many statements as possible to increase your chances of being found, but it'll end up working against you.

E.g. a photograph of a BMW concept car with a dog somewhere in the background taken at the Berlin Motor Show

Do not add:

  • sky (irrelevant: that's just the setting)
  • dog (irrelevant: it's not what the picture is about - it contributes nothing to the image)
  • Berlin (irrelevant: while it was taken at Berlin, and depicts a few square meters of Berlin soil, it is not representative or Berlin)
  • car (too generic, and no-one searching for a generic "car" will use the image of a highly specific type of car, they'll skip right past it)

Do add:

  • concept car
  • electric car
  • BMW
  • Berlin Motor Show

Remember that it's impossible to optimize for everything. You simply can't score well on all search terms: emphasizing one thing takes emphasis away from another.
While it may help your image be found for more terms (e.g. "dog" in above example), users will ignore the file if it's not relevant enough for those terms (if they're looking for a "dog", than your care photo is not what they'll want.)
Meanwhile, it does come at the cost of decreased relevance for the terms that actually matter.
We've not only been working on an improved multimedia-focused search algorithm, but we're also experimenting with a new UI that puts the multimedia front & center, where it becomes much easier to find relevant multimedia in the blink of an eye (or ignore irrelevant ones)
It is acceptable to add more generic terms, as long as they are a good representation of that generic concept, or if you simply don't know more detail.

Add as many relevant statements as possible, but only relevant statements!
If you have more information about those statements (e.g. aliases or translations in other languages), you might even want to add those as well, to make it more likely that that statement will be found in the first place.

Thanks @matthiasmullie! I copied this into a google doc here:

https://docs.google.com/document/d/1MWVqEgUOu7IvROwa_oUjvboP2PyQZ_x4-ymUc2HTwi4/

I'll take a look over the next few days and post questions/comments/suggestions there.