Problem
When searching for files on a wiki CirrusSearch constructs a query that targets the local wiki index but also the commons index.
Reason is that files can be uploaded locally (generally for fair use images but also for privacy reasons on private wikis).
In order to avoid showing duplicated file entries (a file that has the same name on commons and the local wiki) in the search result CirrusSearch filters out results from commons that are known to have a duplicated entry on the local wiki.
This process requires that all the wikis might possibly trigger an update to the commons index.
This is not ideal and it is unclear if doing this at index time has value:
- only works if the local file is uploaded/updated after the file on commons
- the dupe flag is not cleaned up when no longer necessary
- over all the wikis enwiki has 79555 identified dupes (only 0.08% of the files in commons), but only 23539 are actually duplicates, in other words 70% of those are false positives
- e.g. https://commons.wikimedia.org/wiki/File:Cisti.jpg is marked as dupe for enwiki but this file is not actually part of enwiki
Solution
Since we do not know how often this filter is effective, it remains unclear, if this behaviour is worthwhile being ported to the search update pipeline. Hence, we will disable the filter via configurable flag and count the duplicates we see in result pages. We will monitor the results as well as any community feedback over a period of one month. Our hypothesis is, that duplicate results in file searches are rare and therefore can be lived with. If we get proven wrong, we'll turn on the filter again and think of alternative ways of maintaining the local_sites_with_dupe array, for example by using a periodic job.
AC:
- a config option exists that turns of the usage of the local_sites_dupe filter in elastic search queries
- a metric exists that indicates the number of duplicates in search results
local_sites_dupe is no longer maintained and updatedfile duplicates are removed from the result set in a similar way that missing revisions are filtered out