MediaSearch searches are more restrictive than Special:Search
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Etonkovidova
	Nov 3 2021, 5:08 PM

Description

MediaSearch returns fewer results from the File namespace than Special:Search does

Note: Special:Search with only the File namespace selected triggers the MediaSearch backend, so in order to see this you'll have to select the File namespace and at least one other

Example from production:

MediaSearch for cat that plays returns 398 results
Special:Search for cat that plays returns 141710 results (searching File and Institution namespaces - searching Institution on its own returns zero results)

Example from betalabs:

On betalabs in a private browser window without logging in, type in the search text field a search word : rose (https://commons.wikimedia.beta.wmflabs.org/w/index.php?search=rose&title=Special:MediaSearch&type=image). Only two results will be displayed:

Screen Shot 2021-11-01 at 12.19.22 PM.png (1×2 px, 498 KB)

Click on "Switch to a Special:Search" - Special:Search will display more results than Special:MediaSearch

Screen Shot 2021-11-01 at 10.39.29 AM.png (1×2 px, 708 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Account for matches with 0 < score < 1 in normalization	mediawiki/extensions/WikibaseMediaInfo	master	+39 -50

Customize query in gerrit

Related Objects

Mentioned In: T290853: Refactor Search Handler
Mentioned Here: T296631: Reconsider normalizeFulltextScores implementation

Event Timeline

Etonkovidova created this task.Nov 3 2021, 5:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2021, 5:08 PM

Etonkovidova mentioned this in T290853: Refactor Search Handler.Nov 3 2021, 9:17 PM

CBogen assigned this task to Cparle.Nov 4 2021, 3:06 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

CBogen moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.

Cparle updated the task description. (Show Details)Nov 5 2021, 5:06 PM

The problem here is our normalizeFulltextScores code. Essentially what it does is reduces the score from the text part of the elasticsearch query if we're searching for more than one word[1]. The reason for this is the more words you search for at a time the higher a score you get from elasticsearch, so we reduce it a little so that the fulltext part of the query doesn't swamp the statements part

In order to do this we multiply the fulltext-score by a factor. Elasticsearch doesn't provide a native way to do this, so we have been working around this using logarithms ... but a log of a score will produce a negative numbers if the score < 1, and elastic doesn't like negative numbers. We have more workarounds for this, but the upshot is we end up excluding a bunch of results with low scores ... and the only way to avoid that is to give all low-scoring documents the same fulltext score.

I kinda suspect the best option here might be to just remove the normalisation. @matthiasmullie what do you think?

[1] It simply multiplies the score for the fulltext part of the query by 0.8 if the query has more than one word in it.

Change 737911 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/WikibaseMediaInfo@master] Account for matches with 0 < score < 1 in normalization

https://gerrit.wikimedia.org/r/737911

gerritbot added a project: Patch-For-Review.Nov 10 2021, 1:11 PM

Cormac described the issue quite accurately: since the new scoring profile (logistic regressions), boosts changed massively.
It used to be safe to ignore scores between 0-1, but not anymore, and a lot of field scores now end up being discarded via the normalization hack.

We'll try to fix the problem (wrapping up patch) in the normalization for now.
Once that fix has gone out, we'll remove the normalization altogether (we suspect it no longer has a significant & consistent impact) while keeping an eye on the metrics (to validate that suspicion)

Change 737911 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Account for matches with 0 < score < 1 in normalization

https://gerrit.wikimedia.org/r/737911

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.12; 2021-12-06).Nov 16 2021, 1:00 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2021, 1:10 PM

In T294953#7495729, @matthiasmullie wrote:

Once that fix has gone out, we'll remove the normalization altogether (we suspect it no longer has a significant & consistent impact) while keeping an eye on the metrics (to validate that suspicion)

Split this part up into T296631

... so I guess this can be closed now, right? Or put into waiting for QA?

matthiasmullie moved this task from Code Review to Needs QA on the Structured-Data-Backlog (Current Work) board.Nov 29 2021, 8:01 PM

Checked on commons betalabs - the search results for Special:MediaSearh and Special:Search look identical (as far as I could see). Moving to Verify on Production.

Etonkovidova closed this task as Resolved.Dec 8 2021, 10:07 PM

	F34722524: Screen Shot 2021-11-01 at 12.19.22 PM.png
	Nov 3 2021, 5:08 PM

	F34722531: Screen Shot 2021-11-01 at 10.39.29 AM.png
	Nov 3 2021, 5:08 PM

MediaSearch searches are more restrictive than Special:SearchClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

MediaSearch searches are more restrictive than Special:Search
Closed, ResolvedPublic
Actions