Page MenuHomePhabricator

MediaSearch searches are more restrictive than Special:Search
Closed, ResolvedPublic

Description

MediaSearch returns fewer results from the File namespace than Special:Search does

Note: Special:Search with only the File namespace selected triggers the MediaSearch backend, so in order to see this you'll have to select the File namespace and at least one other

Example from production:

  • MediaSearch for cat that plays returns 398 results
  • Special:Search for cat that plays returns 141710 results (searching File and Institution namespaces - searching Institution on its own returns zero results)

Example from betalabs:

  1. On betalabs in a private browser window without logging in, type in the search text field a search word : rose (https://commons.wikimedia.beta.wmflabs.org/w/index.php?search=rose&title=Special:MediaSearch&type=image). Only two results will be displayed:

Screen Shot 2021-11-01 at 12.19.22 PM.png (1×2 px, 498 KB)

  1. Click on "Switch to a Special:Search" - Special:Search will display more results than Special:MediaSearch

Screen Shot 2021-11-01 at 10.39.29 AM.png (1×2 px, 708 KB)

Event Timeline

The problem here is our normalizeFulltextScores code. Essentially what it does is reduces the score from the text part of the elasticsearch query if we're searching for more than one word[1]. The reason for this is the more words you search for at a time the higher a score you get from elasticsearch, so we reduce it a little so that the fulltext part of the query doesn't swamp the statements part

In order to do this we multiply the fulltext-score by a factor. Elasticsearch doesn't provide a native way to do this, so we have been working around this using logarithms ... but a log of a score will produce a negative numbers if the score < 1, and elastic doesn't like negative numbers. We have more workarounds for this, but the upshot is we end up excluding a bunch of results with low scores ... and the only way to avoid that is to give all low-scoring documents the same fulltext score.

I kinda suspect the best option here might be to just remove the normalisation. @matthiasmullie what do you think?


[1] It simply multiplies the score for the fulltext part of the query by 0.8 if the query has more than one word in it.

Change 737911 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/WikibaseMediaInfo@master] Account for matches with 0 < score < 1 in normalization

https://gerrit.wikimedia.org/r/737911

Cormac described the issue quite accurately: since the new scoring profile (logistic regressions), boosts changed massively.
It used to be safe to ignore scores between 0-1, but not anymore, and a lot of field scores now end up being discarded via the normalization hack.

We'll try to fix the problem (wrapping up patch) in the normalization for now.
Once that fix has gone out, we'll remove the normalization altogether (we suspect it no longer has a significant & consistent impact) while keeping an eye on the metrics (to validate that suspicion)

Change 737911 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Account for matches with 0 < score < 1 in normalization

https://gerrit.wikimedia.org/r/737911

Once that fix has gone out, we'll remove the normalization altogether (we suspect it no longer has a significant & consistent impact) while keeping an eye on the metrics (to validate that suspicion)

Split this part up into T296631

... so I guess this can be closed now, right? Or put into waiting for QA?

Checked on commons betalabs - the search results for Special:MediaSearh and Special:Search look identical (as far as I could see). Moving to Verify on Production.