Page MenuHomePhabricator

MediaSearch should not pick up redirects
Closed, DeclinedPublic

Description

As a MediaSearch user, I do not want files to appear in search results as a result of redirects, so that I don't receive incorrect results.

This issue was brought up on wiki here.

If you do a search for [[ https://commons.wikimedia.org/wiki/Special:MediaSearch?type=bitmap&q=Matterhorngletscher | Matterhorngletscher ]]the file File:Aus der Hörnlihütte.jpg appears in the results. Matterhorngletscher used to be in the file's name and description, but no longer is, so the result should not appear. It seems that it is showing up as a result of the redirect, which should not happen.

The goal of this ticket is to remove redirects from the MediaSearch index.

Event Timeline

matthiasmullie added a subscriber: matthiasmullie.

We've manually been assessing thousands of search results, and the data that we have indicates that redirects are a pretty good signal (worse than titles, but better than text)
Ergo: we probably should not remove redirects from the data used by the algorithm, just because there's false information in there: there's true for all other fields.
I think the right thing to do in this specific case would be to remove the redirect (given that it's essentially false - you also wouldn't want "dog" to redirect to "mona lisa")

Sidenote: in the long run, we *might* be able to address this problem somewhat by setting a threshold, where only results that meet a high enough score (depending on how good of a match for how good of a signal(s)).
This will be another ticket, once we have enough data to inform us what a good threshold would be, and after having verified it wouldn't have the adverse effect of dropping too many valid matches.

Closing as invalid: there's nothing that we can or should do here: redirect, in general, have proven to be a good source of data, so we can't remove them. This just happens to be in instance where the data is incorrect (much like how many words in a file's description are also fairly irrelevant, but that doesn't mean we shouldn't use descriptions altogether)

So then this needs a policy update on commons when to DR redirects. And we need a global cleanup of such wrong redirects to clean the results of media Seacrh