Page MenuHomePhabricator

Deepcategory search does not work with MediaSearch on commons
Closed, ResolvedPublic

Description

The example query deepcategory:"Manufacturing by product" works on Special:Search, returning results, but fails on Special:MediaSearch, returning no results.

A brief investigation shows that MediaSearch is not applying the deep category, instead filtering for only the single provided category. A secondary problem is that MediaSearch is not displaying the warning shown on Special:Search, but that may be related to it not properly running the deepcategory in the first place.

AC:
Mediasearch queries that use deepcategory correctly apply the list of categories

Event Timeline

Let's timebox to a day of investigation and decide what to do then.

Change #1143675 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/MediaSearch@master] Forward backend warnings to Special:MediaSearch

https://gerrit.wikimedia.org/r/1143675

I did some testing, the failure isn't consistent. Reloading the same page multiple times, sometimes deepcategory will work in Special:MediaSearch, and sometimes it wont. From monitoring logs while doing these searches, i think what's happening is the deepcategory query is timing out and the warning is not propagated to the user.

The above patch adds code to propagate the warnings to the user so they at least know whats going on. As to fixing the problem, I'm not really sure. One option we've pondered is to add more options to the deepcat keyword so users can make it less likely to time out. Perhaps an option to reduce the category depth. But we don't have a great story around the consistent way to parameterize a keyword like this. something like deepcat:"4;Manufacturing by product" might be plausible to use a lower depth?

Change #1143675 merged by jenkins-bot:

[mediawiki/extensions/MediaSearch@master] Forward backend warnings to Special:MediaSearch

https://gerrit.wikimedia.org/r/1143675

This looks to now work as expected. Both of the example links give the same warning message.

As mentioned above, the particular category used here is right on the edge for the timeout. Sometimes the query results in a warning about too many categories, and sometimes the query results in a warning about the category query timing out. When timing out the Special:Search link does have results when the Special:MediaSearch link does not, but that is only about namespace choices. The Special:Search query is returning results from the category namespace that are in the named category, while there are no files for MediaSearch to return in that parent category.

One option I've briefly investigated is changing the way our deepcat query runs. Currently the query is something like (pseudo-code) sort(all_results, by=depth)[:limit]. It has the benefit of returning the top-n nodes closest to the source category, but has the downside that there is no possibility to early-exit, the db has to visit the full subset of the category graph.

An alternate query would be something like all_results[:limit]. This will return the first-n nodes that the graph db visits. It would have no particular guarantees about which subset of categories are returned, but it would keep the guarantee that all returned categories are within 5 steps of the source category. It also likely means that different backend servers will return a different subset of categories, repeating the same deepcat query will get different results. The upside is that this can early exit and should avoid timeouts on even the largest categories.

I'm not sure which tradeoff is better though?

Great that this works now. EBernhardson, is there a way to leave things as they are but give the user the option to click a button to load more files (larger depth)? e.g. [click here to load more]

Here is an example for which real-world use-cases include: 1. contributors using it to fix miscategorizations to spot files that are not microscopic images to subcategorize things accordingly and 2. users interested in good-quality microscopic images looking for a way to easily conveniently browse them without having to dig through countless unsortable nest subcategories. It does not load all files which is for example an issue when you'd like to sort by recency or when you'd like to further search the results (e.g. with an extra search term) or when one has scrolled to the end of the search results.

Edit: separate issue is now at T395348

@Prototyperspective : Adding more UI and more ways to navigate categories as part of Search sounds like a reasonable idea, but is outside of the scope of this particular ticket. Could you open another feature request? Thanks!