Page MenuHomePhabricator

Project Typeahead queries with less than four letters return random unrelated results
Closed, ResolvedPublic

Description

I was originally poking T360701: Autocomplete project proposals in Maniphest search form don't always offer "any" and "not" options but at some point wanted a more self-contained testcase.

In my local test instance, I created two projects named LuaSandbox and Wikibase-Lua.

When typing Lua into a Projects field in the Maniphest search, I'd expect prefix search (self::PHASE_PREFIX) to find LuaSandbox and then token search to find Wikibase-Lua.
And that's what my local instance does for http://phorge.localhost/typeahead/class/?class=PhabricatorProjectDatasource&q=lua

Our production instance only behaves correctly for the prefix search.
Afterwards the token search throws 100 random projects at us.
Comparing the executed SQL queries things take a completely different turn than on my local instance.
It basically looks like production skips the PhabricatorProjectDatasource::withNameTokens() query _completely_ (I assumed that maybe $tokens is null?), so $projs = $this->executeQuery($query); is executed with no parameters in PhabricatorProjectQuery.

Comparing the involved source code files, I found custom downstream https://phabricator.wikimedia.org/rPHAB9e9b2d958736c0fc39776a284890e3c78cd98095 and https://phabricator.wikimedia.org/rPHABb5bcd9123b502249d0d4931a558efa594ed7b984 for T150965: phabricator close to saturate its database connections .
That code removes any input with less than 4 letters.
Like lua.
And then runs a query with no conditions for projects.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Adjust numerous custom search query length restrictionsrepos/phabricator/phabricator!105aklapperT407003wmf/stable
Customize query in GitLab

Event Timeline

Aklapper triaged this task as Medium priority.

The downstream changes were put in place because of long running global fulltext searches.
The Global Search (in my limited understanding) queries all typeahead sources, and only one typeahead source is Projects.

So I'd say putting a custom patch just for Projects is not the best approach.

In upstream rPHAB61ce56ff I restricted the number of search tokens to 61. This is deployed on our Wikimedia instance.
If global fulltext search was the reason for this code, then a one-line change to lower the return value of PhutilSearchQueryCompiler.php::getMaxQueryTokens() makes more sense.
That might be alright now that we moved e.g. Phatility search from the Global Search to the Maniphest Search in T398305: Query Maniphest Advanced Search instead of Phabricator Primary Global Search.
However, I have no clue which other external stuff may query our global search with lots of search tokens.

In any case, the current downstream code which does if (count($filtered) > 5) makes no sense when not in a global fulltext search context but in a project typeahead context, as the typeahead search interprets this as project.name LIKE 'fooo barr kdlf lkdj wejh ktjg bleh%') in SQL. No tokenization.

And the current if (strlen($token) > 3) { $filtered[] = $token; } leading to no tokens at all, ending up with return $this; as-was-before, is a problem.
FYI if we really wanted to properly drop query terms based on short length, then the right place to do that is in PhutilSearchQueryCompiler::tokenizeQuery() (but do note the use of phutil_utf8_is_cjk(), plus would have to take into account the substring operator ~ here).

So in summary, I think we should

  • either reduce or remove custom if (strlen($token) > 3) { $filtered[] = $token; } to mitigate the problem of useless search results (e.g. to only potentially appear with single letter typeahead input which may still trash fulltext search for a single-letter though which is acceptable),
    • or introduce code in PhutilSearchQueryCompiler to drop very short tokens
  • potentially remove the custom if (count($filtered) > 5) condition which for no reason special-cases only project search (though it does work - e.g. when entering seven tokens in the "Name" field on /project/query/advanced/ the last ones are ignored as the SQL query shows)
  • likely lower the general threshold for the number of search tokens in freetext search in PhutilSearchQueryCompiler::getMaxQueryTokens() from 61 to something arbitrary like 16 (Logstash will tell us how often searches will break with an error).

After doing some changes, I'd like to revisit T360701 which may not be fully fixable as we may have >102 #Wikibase# projects so search result pagination interferes, as kinda described by Evan in https://secure.phabricator.com/T8510#180681.

Current behavior is both sad and funny:

Screenshot From 2025-10-10 21-11-06.png (444×673 px, 21 KB)

I considered allowing a different number of search tokens depending on whether the user is logged in, but it's impossible to reliably set/pass $this->getViewer() in all PhutilSearchQueryCompiler constructors to PhutilSearchQueryCompiler::getMaxQueryTokens() according to my local testing.

This got deployed today.
Verified the fix by going to https://phabricator.wikimedia.org/typeahead/class/?class=PhabricatorProjectDatasource&q=lua now showing 4 instead of 102 results.