Page MenuHomePhabricator

Explore how to make a category search that includes sub-categories.
Open, Needs TriagePublic

Description

To encourage and empower edit reviewing and patrolling based on subjects of interest, the Integrated Filters project includes Category search filtering tools (T163433). The usefulness of this potentially powerful function is significantly reduced, however, by known limitations to the organization of categories: Contrary to behavior of many familiar categorization schemes, wiki categories do not contain the content of their sub-categories. So, paradoxically, the broader the category a user searches on, the fewer page results he's liable to find—because the broadest categories often contain only other categories, and no actual articles.

To get around this limitation, we will explore the feasibility of a search that "crawls" down into at least some few of its immediate sub-categories, to find changes that pertain to the articles they contain. At this point, this task is an engineering research project. Questions to explore include:

  • How might this be done?
  • How many sub-levels can be crawled before performance is a problem?
  • Is there, perhaps, some way to adjust the number of levels crawled based on how populous the categories are?

Once we've explored these and other questions, we'll decide whether a scheme to achieve our goal is practical and useful, and how to represent this unusual functionality to users.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm afraid this is not very feasible, because categories can be very deeply nested and we have no efficient way to get members of subcategories (and subsubcategories, etc) right now.

Probably the only ways we could do this performantly are:

  • Maintaining a separate store for this purpose, e.g. in Redis (either for all categories or for a limited number).
  • Changing the DB schema (unlikely)

I'm afraid this is not very feasible, because categories can be very deeply nested and we have no efficient way to get members of subcategories (and subsubcategories, etc) right now.

That nesting issue is known. That's why most category-searching tools have a parameter to limit depth in search. After, filtering is irrelevant. Would it help?

jmatazzoni renamed this task from Improve Special:RecentChangesLinked to filter categories and sub-categories to Explore how to make a category search that includes sub-categories. .Jun 16 2017, 10:44 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)

That nesting issue is known. That's why most category-searching tools have a parameter to limit depth in search. After, filtering is irrelevant. Would it help?

CategoryFinder is implemented as a post-processing step: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/specials/SpecialRecentchanges.php;d4f51b7881e801362eac46516f0f7e312a8066d8$853

Either with that (or something similar), or with the Redis-based solution, we'll have the issue of limit mismatches, because AFAIK descendant categories can not be done in a single query.

I.E. Assume there are thousands of edits in the recentchanges table to articles nested within Category:History of the United States. You request 500 results on Special:RecentChanges.

Because the DB can not implement this in a single query, you get 500 results (including edits to articles that are not within that category at all). Then, CategoryFinder runs, and gives you a small fraction of that. You then either:

  1. Get that small fraction as your final result set, or
  2. Have to keep re-running the DB query until you get 500 or exhaust the table.

The same issue applies if it's implemented with Redis. (To clarify, the Redis idea is to have a Redis set for each category, containing all articles directly or indirectly within that category. This would be maintained after each edit, similar to what GettingStarted does (but GettingStarted does not handle nesting)).

The query/limit issue is not a deal-breaker, but we have to decide how to handle it.

We've had similar issues with Flow.

Also, we need to confirm with Ops about Redis persistence (from T158239: Improve GettingStarted data storage strategy, "Persistency on redis (and replication) has always been guaranteed as best-effort, which doesn't seem to be great in this case.") We may need a new instance.

jmatazzoni claimed this task.

So it sounds like we've thought about this and see it as kind of a big deal to fix. Is that fair? @Mattflaschen-WMF, @Catrope, I'm closing this, right? Or is there something else you want to think about? (Please re-open if you think we have another idea to explore.)

Yes, I would say doable, but there are both product (500/limit stuff above, plus whether to limit nesting) and technical issues that make it kind of a big deal.

jmatazzoni added a subscriber: Pginer-WMF.

I'm reopening this to ask about some possible strategies, or work-arounds, for this. @Catrope and @Mattflaschen-WMF, would any of these approaches be possible? (I don't yet know how the UX would work for these, but hypothetically):

  • When we list categories in the categories menu (T163433), would it be possible to show as well the stats for pages and subcategories next to the name, as happens all over the wiki on category pages. E.g.: Rock Music Groups (1 p, 51 c). Once users got used to this, they might start realizing that categories such as this one really don't have articles in them.
  • What if we just always searched the category the user specifies as well as all categories one level down. I.e., if I searched the category Rock music groups by genre‎, I'd see changes made to articles in any of the 51 genre subcategories. Still too much?
  • Alternatively, would something like the following be possible? Imagine the user searches the broad categories Rock Music Groups and Punk Rock Groups. We return the small number of relevant article results that might appear, but also display a box at the top of the page saying:

The categories you searched contain the following subcategories, which you may wish to add to your search (check box to add):

Punk Rock Groups

  • Punk rock groups by nationality‎ (50 C) [ ] Anarcho-punk groups‎ (1 C, 127 P), [ ] Christian punk groups‎ (1 C, 49 P), [ ] Dance-punk musical groups‎ (82 P), etc.

Rock Music Groups by Genre

  • American rock music groups by genre‎ (23 C), [ ] Australian rock music groups by genre‎ (14 C), [ ] British rock music groups by genre‎ (22 C), etc.
  • When we list categories in the categories menu (T163433), would it be possible to show as well the stats for pages and subcategories next to the name, as happens all over the wiki on category pages. E.g.: Rock Music Groups (1 p, 51 c). Once users got used to this, they might start realizing that categories such as this one really don't have articles in them.

Yes, but not sure if that communicates the subcategory issue to the user (may depend if they're a power user).

  • What if we just always searched the category the user specifies as well as all categories one level down. I.e., if I searched the category Rock music groups by genre‎, I'd see changes made to articles in any of the 51 genre subcategories. Still too much?

We need to figure out how many categories we can practically combine into one query (this is both relevant to this, and to "This means that when multiple categories are added to a search, each broadens the search, since all Category filters relate to one another via an OR." in general).

https://en.wikipedia.org/wiki/Category:People_by_nationality has 251 sub-categories. There are probably non-obscure categories with more.

Even if we accept a large number of categories explicitly specified by the user (perhaps they can judge for themselves what is too slow, and reduce the number), this automatic option could potentially cause unintuitive slowdown.

  • Alternatively, would something like the following be possible? Imagine the user searches the broad categories Rock Music Groups and Punk Rock Groups. We return the small number of relevant article results that might appear, but also display a box at the top of the page saying:

Yes, but see above about limits (could also have UI issues with hundreds of categories).

We need to figure out how many categories we can practically combine into one query

They can be done in one SQL query just by adding to the cl_to clause, but I don't know when we'll start seeing issues. Also, I think this will cause the same (and need the same solution) as T168501: When querying for multiple tags, and more than one is in the edit, duplicate results.