Testing search in MediaSearch - Part II

“But what am I going to see?
I don't know. In a certain sense, it depends on you.”
― Stanislaw Lem, Solaris

In Testing search in MediaSearch - Part 1 the top down analysis was used for identifying what needs to be tested in MediaSearch. The comparison analysis has been done to prove that MediaSearch, at a minimum, performs as well as Special:Search. All Special:Search functionality should be present in MediaSearch, while the search still works reliably and the users' workflows should not be disrupted. Testing the UI served the same purpose: reliable search, search features are easy to use, and the users have a better experience than with Special:Search.

But the most important question remains: how well the new search answers users' expectations? From the test outline diagram there are other aspects of search that need to be explored to answer that important question - Relevance, full text search vs phrase match search, and Searching in different languages .

Screen Shot 2021-06-30 at 4.24.37 PM.png (970×1 px, 292 KB)

Let's start with Relevance. Assessing search quality is a quite challenging task (e.g. see Research:Measuring User Search Satisfaction (1)). The search relevance is creatively described that it's "like beauty, is in the eye of the beholder." (2):

[...] relevance is subjective, there is no way to return the perfect result set. There are, however, various approaches and tools that can be used to tune the result set for the most optimal results for your users.

"How scoring works in Elasticsearch" (2)

"Run some search queries and see if the results make sense" - this commonsense approach, probably, would work. I started with a search term that cannot fail - cat

Search termSpecial:MediaSearchSpecial:SearchMediaSearch full text search ES score (first result)
Screen Shot 2021-06-29 at 7.58.23 PM.png (1×2 px, 2 MB)
Screen Shot 2021-06-29 at 8.27.13 PM.png (1×1 px, 551 KB)
98.84846 ( Link to scores: cat )

Only pictures of cats are returned by searches with Special:Search and with MediaSearch and Special:Search results match MediaSearch results. ES score (ElasticSearch score) for the first picture is high - the test is a success.

However, the next simple search - search term king - is a little more confusing:

Search termSpecial:MediaSearchSpecial:SearchMediaSearch full text search ES (first result)
Screen Shot 2021-06-29 at 6.04.06 PM.png (1×2 px, 2 MB)
Screen Shot 2021-06-29 at 8.39.41 PM.png (1×1 px, 700 KB)
82.770256 (Link to scores: king)

Again, there are no differences in the search results between Search and MediaSearch, which is great. But what does the first picture in the result - File:Polistes May 2013-2.jpg- have to do with the search term "king"? If we look closer, the surprising results are not that surprising. It turned out that the relevance scoring relies on the structured data and the structured data for File:Polistes May 2013-2.jpg has the statement "monarch" which is a wikidata item - Q116 that has the "king" label:

Screen Shot 2021-06-29 at 8.55.42 PM.png (978×2 px, 259 KB)

Another unexpected item in our search results is the picture of potatoes. The ES score is 58.06765, and it's rather difficult to see the relevance to the search term king. The file name for this picture of seemingly plain potatoes explains why it was included in the results - File:King Edward.jpg. Since the term "king" is in the file name, naturally, the picture will be returned when you search for "king". The structured data, files names, captions, and descriptions are what the search is based on.

Next to test - full text search and phrase matching search. When we search for Newton's cradle, the results are delightful: only results that explicitly portray Newton's cradle are returned. Not relevant results from separates searches Newton and cradle are not present.

search term - Newtonsearch term - cradlesearch terms - Newton's cradle
Screen Shot 2021-06-30 at 4.37.47 PM.png (1×2 px, 3 MB)
Screen Shot 2021-06-30 at 4.34.07 PM.png (1×2 px, 2 MB)
animation6_gif.gif (721×1 px, 2 MB)

What if I add the quotation marks and search for "Newton's cradle"? Will it limit the search even more (it should)? Yes, searching for "Newton's cradle" will bring only 50 results vs 56 results for Newton's cradle. And those 6 extra results are scored lower than the top results.
The next test is to check how phrase matching search will work for completely unrelated search terms - e.g. pika ballet. Would it find results for pika and ballet and return randomly (and ES scores will be low?). Nope, the search is smart enough to return nothing:

Screen Shot 2021-06-30 at 4.55.58 PM.png (1×2 px, 188 KB)

Searching in different languages is the last challenging item to test. How well the search performs for different languages? What should be the expected results? If I search for кот (Russian: cat), do I expect to see only the images that have кот in their structured data or description? Or the results should be the same as searching for cat since, thanks to wikidata, the labels for Cat (Q1022892) includes many languages. The search results for кот are very, very similar to the search results for cat:

Screen Shot 2021-06-30 at 5.08.28 PM.png (1×2 px, 3 MB)

So far, so good. Testing for chat (French: cat) returns cat images and images that relevant to "chat" (as a conversation), which is expected (and, maybe, even desired) behavior. So, when T282583 Searching in non-English languages doesn't return expected results phab task was filed, it caught me by surprise. The description of the phab task is short:

For example


contains a lot of images with 'chien' in the title before we get any images of dogs, despite the uselang=fr param. >statement matches are rated more highly than title matches, so that's unexpected

While testing searching in different languages, I viewed the results of suboptimal relevance as acceptable. But logically, when a user sets the UI language in User preferences or via uselang=, the search should give preferential ranking for results that relate to that language! So, searching for chien (French: dog) with uselang=fr should return more pictures with dogs in top results than for the the same search that performed with uselang=en. The issue was promptly fixed and the table below shows the improvement.

Before the fix - commons wmf.7After the fix - commons wmf.9
search term "chien"search term "chien"
French UI: https://commons.wikimedia.org/w/index.php?search=chien&title=Special:MediaSearch&go=Go&type=image&uselang=fr
Screen Shot 2021-05-27 at 11.48.04 AM.png (1×2 px, 3 MB)
The second image has dogs, that is a good improvement.
Screen Shot 2021-06-09 at 4.30.00 PM.png (1×2 px, 3 MB)

Testing search made me realize many aspects of how search results might be evaluated by our users and, based on testing results, to see whether MediaSearch needs to be improved. And it's incredibly rewarding to see improved search experience since having efficient and reliable search functionality adds a lot to user experience.

Special thanks to @matthiasmullie and @Cparle for guidance, insights, and specific tips on testing search functionality in MediaSearch.

  1. Research:Measuring User Search Satisfaction https://meta.wikimedia.org/wiki/Research:Measuring_User_Search_Satisfaction retrieved on June 28/2021.
  2. "How scoring works in Elasticsearch" https://www.compose.com/articles/how-scoring-works-in-elasticsearch/ retrieved on June 28/2021.
Written by Etonkovidova on Jul 1 2021, 2:54 AM.
"Like" token, awarded by ReaperDawn."Like" token, awarded by thcipriani."Like" token, awarded by zeljkofilipin.

Event Timeline