Often times, the search for subcategories results in results sets that are too big, and cannot effectively be used as a basis for search results.
We set a limit how many subcategories we support. If that limit is hit, we inform the user that the category was too unspecific, which is why we can't return anything. In the calculation of how many subcategories there are, loops should be detected and not counted.
= Reasoning / Background info:
- We don't want to return incomplete results
- We need to set a limit
-- Elastic has a max of 1,024 conditions (categories) that it can have in any query
--- ie: if we're searching for 1,000 categories, there are 1,000 conditions
--- Elastic searches categories breadth first and then depth
-- But there are additional conditions that are also taking up numbers because of the search string itself
--- ie: depending on how complex the query is, that is the remaining number of conditions (categories) that we can search for
-- One option for paring down things is that there might be a way to look first in the database for categories that don't have any pages associated with them (empty), and thus, not show them in the query results and we can hopefully return more useful results.
- We want to notify users if the category hit the limit (WMDE UI component)
-- exclude loops in the category tree
-- exclude empty categories from the result list
-- have a deep cat keyword for search
-- The search query building probably needs to be a combination of API and curl
- How many empty categories are there?
-- enwiki - 1.5million total categories, 400K are empty (~25% are empty)
- Can empty categories be excluded?
-- Yes, kind of. We can do daily or weekly dumps of the categories for searching on. It takes about 3 hours to update the database and it's unusable until it's complete. We don't have the ability to do real time database updates.
- Links to the WMDE catwatch project:
-- user facing docs: https://www.mediawiki.org/wiki/Manual:CategoryMembershipChanges
-- job that detects the changes: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/jobqueue/jobs/CategoryMembershipChangeJob.php
= Action items:
- We need to set a limit.
-- ~~Let's start with using 800 categories as a limit~~256 per category is the implemented default, since you can also use deepcat multiple times in the same query
-- (New) we also implemented a depth limit of 5 to have more reasonably results
- Apply empty category filter (ignore them)
- Apply the limit of category counts (800)
- Use the daily database dump to search on
-- Users can’t use the search while the dump is loading
--- Stas will investigate to see if we can minimize the user’s not being able to search because the database is locked
- Create an API and cURL combination for the keyword creation
-- WMF will do this work and it should take a couple weeks
--- goal is to be done by end of January 2018
-- the keyword will have a UI component
--- WMDE will complete the frontend work to expose it
- Handover between the Discovery team and Technical Wishes team