Page MenuHomePhabricator

[L] Use English labels/aliases of matching wikidata entities to expand the wikitext, title, caption & categories search terms
Closed, ResolvedPublic

Description

When a Dutch user searches “vleermuis” and we find that matches entity Q28425 (“bat”), then it also makes sense to include “bat” and “Chiroptera” in the title/wikitext/category/English caption, because most content just happens to be in English (possibly also with slightly lower rank, to better surface own-language matches, which are likely more relevant)

Event Timeline

CBogen renamed this task from Use English labels/aliases of matching wikidata entities to expand the wikitext, title, caption & categories search terms to [L] Use English labels/aliases of matching wikidata entities to expand the wikitext, title, caption & categories search terms.Jul 15 2020, 4:34 PM

Change 643388 had a related patch set uploaded (by Anne Tomasevich; owner: Anne Tomasevich):
[mediawiki/extensions/WikibaseMediaInfo@master] Add English aliases of top related Wikidata items to search query

https://gerrit.wikimedia.org/r/643388

I think I've gotten this to a point where it provides a benefit to the results set, but it's going to need a lot of review and some discussion with @matthiasmullie (in 2021!)

As noted in the patch, through a LOT of manual testing, I have found that the following seems to work well for a variety of search terms:

  • Limiting to the 2 most highly-scored related Wikidata items
  • Limiting to 10 aliases
  • Adding a filter query that contains a phrase MultiMatch query per synonym. I tried adding BoolQueries and playing around with the minimum_should_match option both on the per-synonym queries and on the overall BoolQuery, but this either let in a lot of noise, or was so restrictive that it made it nearly impossible to add anything to the results set based on synonyms. The phrase queries mean that not a lot of results will be added that weren't already there due to the search term (not always, see the opossum example below), but that seems to be a good thing because otherwise a ton of unrelated files might be added (see Diablos Rojos del México example below). Better to focus on boosting files that are more likely to be relevant due to the presence of synonyms via the ranking queries.
  • Adding ranking queries similar to those for the search term, with a couple of differences:
    • Multiplying the field boosts by a factor between 0 and 1. I found that 0.25 was enough to see a positive effect without boosting synonym-related results too much. This is still kind of arbitrary, though, and I'm not sure if it's outweighing statements too much.
    • Using DisMax to add only the highest score from a synonym to the overall score (rather than adding the score for each synonym)
To do
  • Consider adding in user language aliases + English aliases
  • Add to test data and expected queries (probably won't do this unless we decide that adding aliases is indeed worth it)
  • Once we can, change code in the entity fetcher so we're not making extra calls to get the aliases
  • Test with media types other than images
  • I feel like this is going to need so much manual testing to confirm that it's actually beneficial and not detrimental. It's hard to think of test cases where this should make a big difference. For many search terms, the addition of this work should not make a discernible difference.
Questions
  • As the patch currently stands, do we need to worry about synonyms outweighing statements? I wasn't sure how to evaluate this.
  • I've found that the minimum score for synonyms is 0.8 or 1 when I would expect it to be 0 if no instance of the synonym is found (see example data below). Need to figure out why this is happening.
Didelphidae - Gracilinanus microtarsus.JPG
I, the copyright holder of this work, hereby publish it under the following licenses: You may select the license of your choice. English author name string:
article id	23130135
ES score	  36.13622
ES explain 36.13622  Sum of the following:
                     35.13622 Minimum Of:
                         35.13622 {code='exp(_score)', options={}, params={}}" and parameters: {}
                         3.4028235E+38 maxBoost
                     1 Minimum Of:
                         1 {code='exp(_score)', options={}, params={}}" and parameters: {}
                         3.4028235E+38 maxBoost
Notable test cases
  • sand cat: This is a search term that currently produces lots of irrelevant results, but whose Wikidata alias ("felis margarita") produces lots of highly relevant (and, might I add, cute) results. I'll note that this actually seems to be a fairly rare case. The open patch adds a few files to the results set and boosts some of them higher in the results. It's a moderate effect, but it's the best I could do without adding noise to results for other search terms.
  • Didelphidae: This is a Wikidata item with the alias "opossums," which is a much more common term than the taxon name. This is a case where adding in the synonym filter query brings in many additional results (before, this search term garnered 94 hits, after this work, there are 1125). The additional results appear to be quite relevant......almost. Unfortunately, this also brings in files uploaded by users with "opossum" in their username. However, I think this is a separate issue, and we've discussed as a team only searching by username when explicitly requested by the user (via advanced search or something).
  • Diablos Rojos del México: This is a football team in Mexico City for which there are about 30 images in the search results. I had a hard time avoiding the addition of irrelevant results for this search term, particularly when I was working with the best_fields query type for the filter queries. Adding minimum_to_match = 2 didn't help because the Wikidata item has 5 aliases containing common words like "diablos" and "Mexico," which pulled in a ton of unrelated results (Mexican flags, pictures of sights in Mexico City, etc.)
  • Laura Jane Grace: This is a case where the results for the Wikidata item label ("Laura Jane Grace") are pretty good, while the results produced by searching for one of this Wikidata item's aliases ("LJ Grace") are totally irrelevant. Fortunately, with the way things are currently configured in the open patch, none of those irrelevant results are added to the results set for "Laura Jane Grace"

Nice analysis, @AnneT!

  • Multiplying the field boosts by a factor between 0 and 1. I found that 0.25 was enough to see a positive effect without boosting synonym-related results too much. This is still kind of arbitrary, though, and I'm not sure if it's outweighing statements too much.

Do you think it'd make sense to dis_max the scores for existing search term & synonyms so that only the winner of those counts?

  • I feel like this is going to need so much manual testing to confirm that it's actually beneficial and not detrimental. It's hard to think of test cases where this should make a big difference. For many search terms, the addition of this work should not make a discernible difference.

I think you might be underestimating the impact this seems to have on non-English languages here :)

  • As the patch currently stands, do we need to worry about synonyms outweighing statements? I wasn't sure how to evaluate this.

I don't know. We should probably A/B test this. Not sure how much the positive impact (mostly in non-English searches) would show, but at least it'd let us know about any negative impact.

  • I've found that the minimum score for synonyms is 0.8 or 1 when I would expect it to be 0 if no instance of the synonym is found (see example data below). Need to figure out why this is happening.

That's an artefact of the score normalization based on amount of terms, whose math is a bit of a hack. I think that has since gotten solved, though.

  • Didelphidae: ... Unfortunately, this also brings in files uploaded by users with "opossum" in their username. However, I think this is a separate issue, and we've discussed as a team only searching by username when explicitly requested by the user (via advanced search or something).

Agree that it's probably a separate issue (it is just as much an issue when one would search "opossum" directly), and this will likely be dealt with (to some extent) in T271799.

  • Diablos Rojos del México: This is a football team in Mexico City for which there are about 30 images in the search results. I had a hard time avoiding the addition of irrelevant results for this search term, particularly when I was working with the best_fields query type for the filter queries. Adding minimum_to_match = 2 didn't help because the Wikidata item has 5 aliases containing common words like "diablos" and "Mexico," which pulled in a ton of unrelated results (Mexican flags, pictures of sights in Mexico City, etc.)

Your current patch uses a phrase in the filter. Would it make sense to also only use phrases for scoring synonyms?

Thanks for the feedback and suggestions, @matthiasmullie! I've pushed some updates to my patch (and will be rebasing and refactoring it soon, too).

I've responded to your comments in the code but there's one thing that wasn't covered there:

Your current patch uses a phrase in the filter. Would it make sense to also only use phrases for scoring synonyms?

I tried this and it crashed the page...I saw a search_phrase_execution_exception with the message "Failed to create query, all shards failed" and a cirrussearch-parse-error with the message "ApiUsageException: Query was not understood. Please make it simpler. The query was logged to improve the search system." I remember running into this during my initial work, maybe because I was trying this same thing, and I figured the multiple phrase queries were just too much for Elastic. But that was just my instinct. Do you know why this might be happening?

Regardless, I think the phrase query is far more important for the filter since it ensures that only the most relevant results are included in the results set at all. We might get better rankings if we can figure out how to make this work, but if we can't, I don't think it's that big of a deal.

(copying this over from Slack)

Using this tool to measure media search matches, I tested a random sample of 20,000 unillustrated articles from Arabic and Cebuano Wikipedias before and after adding synonyms to the search query. This led to slightly more Cebuano articles with matches, but over 4 times as many Arabic articles with matches.

Arabic

  • Before: 8.56% with matches (estimated 47,238 total)
  • With synonyms: 37% with matches (estimated 204,184 total)

Cebuano

  • Before: 10.13% with matches (estimated 106,789 total)
  • With synonyms: 11.31% with matches (estimated 119,288 total)

Change 682569 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/Wikibase@master] Add 'language' param to pageterms & entityterms prop

https://gerrit.wikimedia.org/r/682569

This is blocked on a change in Wikibase (T282654) that's pending review.
Once that lands, this should also be good to go.

Change 682569 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add 'language' param to pageterms & entityterms prop

https://gerrit.wikimedia.org/r/682569

Change 643388 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Add English label & aliases of top related Wikidata items to query

https://gerrit.wikimedia.org/r/643388

Etonkovidova added a subscriber: Etonkovidova.

(1) Testing for “vleermuis” - search=vleermuis&title=Special:MediaSearch&uselang=nl&type=image&cirrusDumpQuery

"vleermui"wikidata itemboost
Q723413 Myotis myotis P180=Q723413 0.0911; P6243=Q723413 0.1002
Q28425ChiropteraP180=Q28425 0.08323; P6243=Q28425 0.09156
Q104870084VLEERMUISP180=Q104870084 0.07398; P6243=Q104870084 0.08138

The number of results doesn't differ much for different UI languages.
Search for vleermuis

UI langNumber of results
en337
nl340
zh337

(2) Testing for search terms in the comment above - Notable test cases

  • Search for sand cat
UI langNumber of results
en3,385
nl3.386
zh3,386
  • Search for Didelphidae
UI langNumber of results
en7
nl7
zh16
  • Search for Diablos Rojos del México
UI langNumber of results
en11
nl11
zh11
  • Search for Laura Jane Grace
UI langNumber of results
en29
nl29
zh29

(1) Testing for “vleermuis” - search=vleermuis&title=Special:MediaSearch&uselang=nl&type=image&cirrusDumpQuery

The number of results doesn't differ much for different UI languages.

While this new profile has been merged, it isn't yet active by default (need to test performance)

In order to activate it, you should also include &mediasearch_synonyms in the url, like this:
https://commons.wikimedia.org/w/index.php?title=Special:MediaSearch&search=vleermuis&uselang=nl&type=image&mediasearch_synonyms&cirrusDumpQuery

You should then see the elastic query also starts to include extra "multi_match" filter clauses for additional terms (in this case: myotis myotis, greater mouse-eared bat, chiroptera, bats & the bats.


I repeated these searches with the synonyms profile enabled:

Search for vleermuis

UI langNumber of results
en395
nl783
zh499

(2) Testing for search terms in the comment above - Notable test cases

  • Search for sand cat
UI langNumber of results
en3,449
nl3.403
zh3,451
  • Search for Didelphidae
UI langNumber of results
en471
nl472
zh474
  • Search for Diablos Rojos del México
UI langNumber of results
en187
nl187
zh187
  • Search for Laura Jane Grace
UI langNumber of results
en29
nl29
zh29

The most obvious differences here are:

  • vleermuis, as the only term you wouldn't expect to see in an English-language dominated content, shows the most pronounced language difference because it is now also able to find media that include the term "bat" or "chiroptera"
  • didelphidae suddenly starts returning many more results (in all languages) because it is now also able to find files that have, in their title/description/caption, any of those aliases: opossum(s), possum(s)
  • Diablos Rojos del México gained many more results, most of which probably come from the "red devils" alias, which bring in many matches that may not actually be good ones (multiple sports teams are called that)
  • often times, results increase overall, across all languages: that's because:
    • even when you were searching English already, you may find additional results via the other English aliases
    • and when you're searching for a certain term in another language, relevant matches in other languages might still be found because of the language fallback system (though likely with a smaller relevance score/boost, potentially low enough to get its aliases discarded, especially if there are other, better, matches in that language)
CBogen added a subscriber: CBogen.

Reopening and putting in Needs QA based on the comment above.

@matthiasmullie should this same task be used to make the profile active when we're ready, or do we need a new task for that?

Either WFM. I'll create (a) new task(s) for additional follow-up, then this one can be closed.

Etonkovidova added a subscriber: Matthias.

Thanks @Matthias for the comment above! I added the test cases to my list of MediaSearch testing notes.