Page MenuHomePhabricator

Autocomplete on exact matches is overly case sensitive
Open, Needs TriagePublic

Description

Test case from en.m.wikipedia.org. First autocomplete search for exa. Top result is Metric prefix (redirect from exa) and EXA didn't make the cut. Then autocomplete search for EXA. Top result is EXA and Metric prefix now doesn't make the cut.

The general problem here is that the completion suggester from elasticsearch is not currently able to give any additional score for an exact match. We instead depend on a MySQL lookup for exact matches, but that lookup is case sensitive and results in the problems above.

Proposal is to:

  • Append a reserved character to the end of all titles in the completion suggester index. Adding an extra character at the end of all titles should have no effect on the existing autocomplete queries, but should be double checked.
  • Add an additional query to our existing search that completes the users query with this reserved character appended. This query will only match full titles and should recieve a significant score boost over the other issued completion queries.

Concerns:

  • Wouldn't be surprised if we have pages like aaa, aAA, aAa, AAa, etc. Yielding a large set of case variants of the users query might be confusing, or simply odd.

We can test this, prior to writing any code, by using elasticsearch scripted reindexing. We can have the script append our reserved character to every title during a reindex (probably on relforge, after using snapshot to copy it over), and then issue manually adjusted queries.