While playing with suggesty we identified obvious precision problems. Some of them are easily fixable but those related to the scoring formula are difficult to address without more data (page views statistics).
1. Lower score for suggestions based on redirects that do not start with the same letter (easy fix).
The strategy used today tries to display the title instead of the redirect (when the redirect is "similar") :
- Albert Enstein will suggest Albert Einstein
- Girafe will suggest Giraffe
But it's quite confusing when the typo is in the first letter
- O will suggest Australia because Ostralia is a close redirect to Australia
- Er will suggest Iraq because Eraq is a close redirect to Iraq
The proposed fix is to remove these redirects from title suggestions and add them to redirect suggestions which will have lower scores.
2. Very high scores
Some pages have very high scores because :
- they are massively linked in footers or headers:
- IP Address
- Copyright infringement
- Digimon: unknown reason, will have to investigate.
- List pages : large pages that link to each others
- Some dates : incoming_link
This is mostly due to the initial scoring function, we expect that the addition of pageviews statistics in the score formula will help to mitigate these problems.
3. Fuzzy weirdness (easy fix)
Today we activate fuzzy suggestions if the user query is >= 3 characters. But we allow the first character to be a typo. We could limit weirdness by allowing typo in the first characters only if the user has typed more than 4 characters.
4. Scoring weirdness
white house will suggest white house farm murders first and not white house, again this is mostly due to our initial scoring algorithm. White house farm murders is a long page (longer than white house), is flagged as a quality page and has a lot of external links.
We hope that pageviews statistics will help to display more "obvious results".
5. Exact match first (question)
Should we make exact match ranked first? This is extremely difficult to make exact match ranked first (without adding a new query overhead) for all queries. With prefix search ambiguous queries like "goo" are not returning exact match first but will return exact match more often than the completion suggester.
If we want to fix this issue solutions could be:
- include the suggestion length in the score (e.g. discount suggestion based on their length), small suggestion will tend to be displayed more often (extremely hard to predict and tune)
- hack the suggestions by adding an invisible char at the end of the suggestion, searching for cpu will actually search for cpu_ and force all other suggestions to be fuzzy, this will certainly display exact match first but will dramatically reduce recall...
- post search re-scoring, if the exact match is returned by the suggester : arbitrarily rank it higher to make sure it's the first result.
I would suggest the 3rd option.
6. Stopwords weirdness (easy fix)
The current stopwords filtering strategy is very bad, the completion suggestion fails miserably on the to be or not to be test.
Proposed fix: do not remove stopwords from the user query.
Current implementation runs a suggestion request without stop words and is causing these weird results, we could try not to remove stopwords from user query but still remove them from the index:
- With a page named The Foo Fighters, searching for foo fighters will match.
- With a page name Foo Fighters, searching for The foo fighters won't match, we'll lose this specific use case.
7. Punctuation (easy fix)
Searching for U=RI (Ohm's law) won't display the correct search result. This is due to the analysis chain which has been optimized for recall. Recall seems to be greatly improved so we can tune this analysis chain to be a bit more "precise".
Proposed fix: switch to a custom white space analyzer for plain matches and keep standard analyzer for plain_stop suggestions.
8. ASCII Folding (easy fix)
Searching for ü will display results in the same order as if you searched u, it's particularly annoying when you make the effort to write diacritics.
The proposed fix is similar to the previous one: remove ASCII folding from plain suggestion and keep it for plain_stop.