Page MenuHomePhabricator

[epic] Subcategory searching
Closed, ResolvedPublic

Description

Problem:

Often times, the search for subcategories results in results sets that are too big, and cannot effectively be used as a basis for search results.

Suggestion:

We set a limit how many subcategories we support. If that limit is hit, we inform the user that the category was too unspecific, which is why we can't return anything. In the calculation of how many subcategories there are, loops should be detected and not counted.

Reasoning / Background info:

  • We don't want to return incomplete results
  • We need to set a limit
    • Elastic has a max of 1,024 conditions (categories) that it can have in any query
      • ie: if we're searching for 1,000 categories, there are 1,000 conditions
      • Elastic searches categories breadth first and then depth
    • But there are additional conditions that are also taking up numbers because of the search string itself
      • ie: depending on how complex the query is, that is the remaining number of conditions (categories) that we can search for
    • One option for paring down things is that there might be a way to look first in the database for categories that don't have any pages associated with them (empty), and thus, not show them in the query results and we can hopefully return more useful results.
  • We want to notify users if the category hit the limit (WMDE UI component)
    • exclude loops in the category tree
    • exclude empty categories from the result list
    • have a deep cat keyword for search
    • The search query building probably needs to be a combination of API and curl
  • How many empty categories are there?
    • enwiki - 1.5million total categories, 400K are empty (~25% are empty)
  • Can empty categories be excluded?
    • Yes, kind of. We can do daily or weekly dumps of the categories for searching on. It takes about 3 hours to update the database and it's unusable until it's complete. We don't have the ability to do real time database updates.
  • Links to the WMDE catwatch project:

Action items:

  • We need to set a limit.
    • Let's start with using 800 categories as a limit

--256 per category is the implemented default, since you can also use deepcat multiple times in the same query

    • (New) we also implemented a depth limit of 5 to have more reasonably results
  • Apply empty category filter (ignore them)
  • Apply the limit of category counts
  • Use the daily database dump to search on
    • Users can’t use the search while the dump is loading
      • Stas will investigate to see if we can minimize the user’s not being able to search because the database is locked
  • Create an API and cURL combination for the keyword creation
    • WMF will do this work and it should take a couple weeks
      • goal is to be done by end of January 2018
    • the keyword will have a UI component
      • WMDE will complete the frontend work to expose it
  • Handover between the Discovery team and Technical Wishes team

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Smalyshev Some open questions :)

  • What is the error that is returned, if the category limit is hit?
  • Can you do multiple deepcategory searches in the same query? E.g. "deepcategory:A deepcategoryB"
    • If yes, will the same category limit error be returned when A and B both have 700 subcategories?

What is the error that is returned, if the category limit is hit?

The error is not implemented yet. Right now it just cuts the list off. I'll add the error message in the query soon.

Can you do multiple deepcategory searches in the same query?

You can use multiple keywords, I think, but one keyword right now is one category.

will the same category limit error be returned when A and B both have 700 subcategories?

Right now limit applies to single instance of the keyword, so if you repeat it, you can get over the limit (which probably will still fail the query if it reaches ElasticSearch limit). Not sure if there's a way to limit multiple invocation of the same keyword. Maybe we will need more systemic clause counting facility then. Will look into it.

Smalyshev claimed this task.

I think this is mostly done. Remaining sub-issues can be handled independently.