Page MenuHomePhabricator

The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories)
Closed, ResolvedPublic1 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Go to Wikimedia Commons and click on its search bar (maybe from a category one intends to populate / check if it's complete)
  • For example search for Wikimedia -deepcategory:"Videos about Wikimedia" in either the modern well-usable wall-of-images search where cat-a-lot does not work and no error shows or the special search where cat-a-lot does work and the error below shows

What happens?:
In the special search this error shows instead of any results "A warning has occurred while searching: Deep category query returned too many categories"

What should have happened instead?:
The deepcategory operator works just fine for other cases but there are problems with large categories. Instead of showing no results it should show as many as possible.

For example, there could be a max number of files to scan or max number of categories. I think ultimately it would be very useful if it worked without limiting the number of subcategories to scan and shows some option if the number is large (it could be set to a low maximum by default). It could also keep checking against subcategories until it reaches a subcategory with a very large number of files or surpassed the files number threshold and calculate the default using that.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Firefox

Event Timeline

The operator works but AFAIK negation is currently not supported.

@Aklapper That is not true. It does work well here for example (which has been very useful to find files missing in the cat) and this category contains thousands of files.

Please reopen and if this case is really specific to issues with the negation of the deepcategory parameter it could be made a subtask of that issue. However, that was only an example and the deepcategory operator also does not work on categories because of the "A warning has occurred while searching: Deep category search timed out. Most likely the category has too many subcategories" error which could be a separate issue.

That is not true. It does work well here

Ah, thanks!

Please reopen

You did that yourself already :)

Gehel set the point value for this task to 1.Jul 15 2024, 3:39 PM
Gehel subscribed.

The Search Plaform team will spend some time to investigate. There have been some issues with Dumps lately, which might have an influence here (the category graph is loaded from dumps). Another potential issue is that the category sub graph is too large in this case and we bail early for performance reasons (Deep Cat Search is a best effort service, that might not do an exhaustive search of categories).

I can't reproduce with the given example in the description, File:Nut_Grab.jpg (page id 29851242) is properly excluded when searching pageid:29851242 -deepcategory:"Animals with nuts". So I suspect that the problem might have been caused by the issues we had with dumps recently. @Prototyperspective could you confirm or possibly provide another example file that does not comply with the search query.
For reference (when writing this comment) the list of categories identified by deepcategory:"Animals with nuts" is:

  • Animals with nuts
  • Animals eating nuts
  • Animals eating peanuts
  • Curculio (larval damage)
  • Animals eating hazelnuts
  • Animals eating walnuts
  • Birds eating nuts
  • Sciurus vulgaris eating walnuts
  • Sciurus vulgaris eating hazelnuts
  • Sciuridae eating peanuts
  • Birds eating peanuts
  • Sciurus carolinensis eating walnuts
  • Curculio nucum (larva)
  • Tamias striatus eating peanuts
  • Sciurus vulgaris eating peanuts
  • Sciurus carolinensis eating peanuts
  • Tamias striatus fed by hand (EIC)

Sorry, bad example it was probably because the category was new and it takes a while for it to work with a new category. It does exclude the file at my side as well now. It was just an example, it also didn't work in many other cases. However, there the problem is the "A warning has occurred while searching: Deep category search timed out". I think it's best if I edit the issue to make it about this particular cause of it often not working (currently it's only in a comment), if I notice it failing on a non-new category with another error I'll add it, I think sometimes it didn't work but also didn't show this error but only whitespace. I shouldn't have only put this error in a comment but added it to the issue right away.

Prototyperspective renamed this task from The Commons search "deepcategory" operator often does not work to The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories).Jul 18 2024, 2:58 PM
Prototyperspective updated the task description. (Show Details)
Prototyperspective removed the point value for this task.
Aklapper set the point value for this task to 1.Jul 18 2024, 7:23 PM

In the last seven days 10246 search queries with deepcat did run with 428 of them resulting in a "toomany error" and 47 in a timeout.
Unfortunately there has to be limits somewhere:

  • 256 categories is what we allow at the moment
  • 3 seconds is the timeout after which we fail

We have to ponder the cost vs benefits of increasing these limits.

There could be also other other techniques to greatly increase the 256 categories limit (up to a couple thousands) using a terms on top of a normalized keyword field, but this requires some changes to the analysis config. Moving back to the backlog so that we can decide how to move forward.

Interesting. However, please keep in mind that:

  • People use this search operator much more rarely if they know it often doesn't work, reducing those numbers...especially for categories where it's likely that they're too large and also because if it works so unreliably/unlikely it isn't considered before or during search as in "how else could I search for what I'm looking for?"
  • More things become possible once this works reliably/also for larger categories, such as in regards to excluding certain images in searches, finding missing items for categories, and so on.
  • Deepcategory is used by FastCCI (and the Deepcat gadget) which is very useful but broken all the time and this may be due to this and even if not it could become even more useful if that was fixed – see T367652

Moreover:

  • Instead of it returning no results, please make it return results of the 256 categories. I don't know why it currently doesn't do this. At the top there could be a note that "the full category tree could not be included because it contains more than 256 categories".
  • When it comes to server performance costs I think one would have to think about how the data is stored and retrieved so indexing/caching is improved so it doesn't have a problem with very large branches
  • One could then think about ways to improve how it scans categories; for example should categories with no or just a few files count? Wouldn't it be better to exclude one category at level 5 that contains many thousands of files and/or very many subcategories compared to the other branches (maybe it could be in the mentioned note) instead of only scanning up to cat-layer 5? I think there could be some kind of auto-detection which subcat it should exclude and up to which level it should scan that then can adjusted if needed. Or it could display subcats included sorted by number of files in them in a collapsed box at the top so they can be excluded with a click. That's just something for the future and may sound more complicated than it is. For now, I think it would be very useful if it would work with 256 cats instead of showing no results.

Returning empty results was requested as part of T188350, I'm not sure if there was a strong reason against returning partial results but this is up for discussion.

Gehel triaged this task as Medium priority.Aug 19 2024, 3:42 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

I tried to get an idea of what kind of limits are hard coded into the underlying server, and what kinds of limits we should enforce.

From the elasticsearch perspective, it looks like the hard-limits are quite loose. While there is a limit of 1024 entries in a bool query, we can simply nest 1024 bool queries within another bool query. In testing elasticsearch had no direct complaints running 200 bool queries with 500 categories each, giving 100k total categories in a single query. This was tested with P68360

That did lead to some interesting behaviour from the search clusters though. I ran it against our eqiad cluster (the busiest one), with a single query at a time (no parallelism) against commonswiki (our largest index). Even though this shouldn't have any effect on the rest of the system we started having some itermittent thread-pool rejection and p95 latency on more_like queries increased from the typical ~400ms to 3s+. Latency on generic full_text also climbed, but stayed under 1s and since more_like is a full_text query hard to say if general full_text was effected. To verify this was the case i stopped my script, the errors declined. About 20 minutes later i started it up again and the errors started climbing again. This is not completely conclusive, but correlated enough for me to declare the oversized queries caused issues. This suggests our limits need to be well under 100k categories per query (which would be massive anyways, but also only a fraction of the 15M+ categories on commonswiki). As a curious side note, the per-shard query latency percentiles didn't change. per-shard p95 reported by elastic stayed ~300ms while the per-query metric observed by cirrus climbed to 3s.

Next test will be to find some reasonable limits. The idea is to collect the full set of categories on commonswiki and run 10 queries at a time in parallel with each query having a random sample of the available queries. This should hopefully better represent what it might look like if we allow queries from the internet at large. Will likely start at 1k queries and re-run in 1k query increments up to 10k.

I'm also thinking if we deploy the expanded deepcat limits we would need to put it behind a poolcounter. I'm thinking perhaps we generalize the regex poolcounter into an expensive-query poolcounter and put both regex and deepcat behind the same one.

Ran a few tests with a varied number of categories per query at 10 parallel requests (what we allow in the RegEx poolcounter). Typical latency of queries other than these increases 20-30% while the deepcat queries are running. Of course in typical operation we (hopefully) wouldn't see someone continuously maxing out the pool counter, but if those are going to be our limits we should understand what happens when a bot excercises those limits. Querying commonswiki is basically worst-case since it's the largest.

Percentages shown are the latency effect on p95 of unrelated queries in the given stats bucket. Essentially simply allowing these expensive queries to run will slow down search for every other use case when they are running.

time period# categoriescomp suggestfulltextmorelike
21:20 - 21:501k10%-15%0%0%
23:20 - 23:402.5k15%5%-10%5% - 10%
22:50 - 23:155k20-30%5-15%5-15%
22:10 - 22:4010k50%15%-30%15-30%

Overall, my suggestion would probably be somewhere in the 1k-2.5k range would be reasonable to deploy. We could push it, but if we did i would prefer to keep a tighter limit on the number of parallel queries. Allowing 10 parallel queries on enwiki (with 7 shards) is probably fine, but on commonswiki (32 shards) it can consume significantly more resources and have more knock-on effects.

Interesting, I thought as well that the 1k limits would apply to nested bool queries (which is probably one reason it was set to 256 initially). It means that we can probably safely bump the limit to 1k without even nesting bool queries. I'm not clear why it has such an impact when getting past 2.5k and I have no clue if a terms query would perform significantly better, it's less costly for sure since there's no need to analyze & rewrite the query, we could probably test this as well to see the impact?
So perhaps we can at least bump to 1k right now with a simple config change and ponder what to do next based on some testing of the terms query? If the terms query does not show a significant gain compared to nested bool queries we might just use this?

Change #1070280 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] deepcat: Increase limit to 1k categories

https://gerrit.wikimedia.org/r/1070280

Change #1070281 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Introduce an expensive query pool counter

https://gerrit.wikimedia.org/r/1070281

Change #1070282 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Remove unused Regex pool counter

https://gerrit.wikimedia.org/r/1070282

Change #1070281 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Introduce an expensive query pool counter

https://gerrit.wikimedia.org/r/1070281

Mentioned in SAL (#wikimedia-operations) [2024-09-03T20:24:28Z] <cjming@deploy1003> Started scap sync-world: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-03T20:26:42Z] <cjming@deploy1003> ebernhardson, cjming: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-03T20:31:15Z] <cjming@deploy1003> Finished scap sync-world: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]] (duration: 06m 47s)

Change #1070280 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] deepcat: Increase limit to 1k categories

https://gerrit.wikimedia.org/r/1070280

Change #1070282 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Remove unused Regex pool counter

https://gerrit.wikimedia.org/r/1070282

Mentioned in SAL (#wikimedia-operations) [2024-09-30T20:05:34Z] <ebernhardson@deploy2002> Started scap sync-world: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-30T20:07:59Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-30T20:13:08Z] <ebernhardson@deploy2002> Finished scap sync-world: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]] (duration: 07m 34s)

Amazing to see progress here which doesn't happen all too often for major issues. Thanks, this is very useful for many diverse applications such as making categories complete and preventing miscategorizations.

However, it still doesn't work on many categories, maybe all of the ones where I previously tried this (and as far as I can see this has been deployed by now). For example, when searching for deepcategory:"Videos English" deepcategory:"Videos in Spanish" to try to find files with contradictory language categorization, it shows the "A warning has occurred while searching: Deep category query returned too many categories" error in SpecialSearch and no warning but no files in the MediaSearch. I have just submitted a new issue about the missing error message in MediaSearch: T376439. I've also found cases where it does not show an error in SpecialSearch either.

Moreover, I think deepcategory searches should not fail but show the results up to this now increased level of nested categories and display in the error message which categories have been trimmed off. For example it also fails when searching the category for music files – example search: deepcategory:"Audio files of music" -deepcategory:"Audio files of music by genre" (link). It would be better if instead of displaying no files, it displayed probably most files and an info message at the box like "Deep category query returned too many categories so MIDI files of melody settings by Peter Gerloff‎ and Chill-out music from Free Music Archive‎ have been excluded". New separate issue about this here: T376440

Moreover, I think deepcategory searches should not fail but show the results up to this now increased level of nested categories and display in the error message which categories have been trimmed off. For example it also fails when searching the category for music files – example search: deepcategory:"Audio files of music" -deepcategory:"Audio files of music by genre" (link). It would be better if instead of displaying no files, it displayed probably most files and an info message at the box like "Deep category query returned too many categories so MIDI files of melody settings by Peter Gerloff‎ and Chill-out music from Free Music Archive‎ have been excluded". New separate issue about this here: T376440

I think the biggest concern we have and probably the reason why partial results were not shown in the first place is that it might lead to wrong results for some queries. Esp. when negating the keyword with -deepcategory:"Large Tree" the partial results could possibly be matching a file actually part of the Large Tree category tree.
We can perhaps show partial results only in some cases or possibly always by making the error message a bit more explicit about this. Sadly I'm afraid that showing the list of categories that are not included might not be practical because the list could be relatively big to fit into an error message.

Addressed these two in the separate issue. In short usually that search operator is not used for exclusion and when it's used for that, it's usually a small branch and if not there's still many cases where the partial results would be very useful and even you example doesn't mean it's useless – the user may e.g. only require more time to glance over the results to not select any images of large trees which the user may need to do anyway since many photos of large trees are not in that cat. The excluded categories could be in an auto-collapsed box (preferred) or only show them if it's fewer than 5 to name two solutions.

It now returns incategory search results instead of no search results. This is better than showing no search results but not what this issue is about. I'm adding this note in case people come across this issue and think it's been implemented since now there are search results. It's also problematic that it's changed with no error message in the MediaSearch – I only found out by searching for deepcategeory:"Science" which shows just 1 image because this many files are located directly in that category instead of in subcategories of it (the search results are those of incategeory:"Science"). Note that caching could be used to display deepcategory results once people use this more widely such as via the Deepcat Gadget.

Regarding returning incategory results, that's been the default behaviour since 2018, although i agree it's not the most obvious. There are a couple error cases, and this one is less obvious because MediaSearch isn't showing the warnings the search backend is providing, Querying through Special:Search gives the warning Deep category search timed out. Most likely the category has too many subcategories. Internally:

  • If the graph query for subcategories returns too many results then the filter is empty
  • If the graph query for subcategories times out then the filter is the source category

It seems like these should be unified to have the same behaviour. Also Media:Search should be updated to include the backend search warnings in the UI.