Change Details

While playing with [[ http://suggesty.wmflabs.org/suggest.html | suggesty ]] we identified obvious precision problems. Some of them are easily fixable but those related to the scoring formula are difficult to address without more data (page views statistics). **1. Lower score for suggestions based on redirects that do not start with the same letter (easy fix).** The strategy used today tries to display the title instead of the redirect (when the redirect is //"similar"//) : * **Albert Enstein** will suggest **Albert E__i__nstein** * **Girafe** will suggest **Gira__ff__e** But it's quite confusing when the typo is in the first letter * **O** will suggest **Australia** because //Ostralia// is a close redirect to //Australia// * **Er** will suggest **Iraq** because //Eraq// is a close redirect to //Iraq// The proposed fix is to remove these redirects from title suggestions and add them to redirect suggestions which will have lower scores. **2. Very high scores** Some pages have very high scores because : * they are massively linked in footers or headers: ** Wikipedia ** IP Address ** Tilde ** Copyright infringement * Digimon: unknown reason, will have to investigate. * List pages : large pages that link to each others * Some dates : incoming_link This is mostly due to the initial scoring function, we expect that the addition of pageviews statistics in the score formula will help to mitigate these problems. **3. Fuzzy weirdness (easy fix)** Today we activate fuzzy suggestions if the user query is >= 3 characters. But we allow the first character to be a typo. We could limit weirdness by allowing typo in the first characters only if the user has typed more than 4 characters. **4. Scoring weirdness ** **white house** will suggest **white house farm murders** first and not **white house**, again this is mostly due to our initial scoring algorithm. White house farm murders is a long page (longer than white house), is flagged as a quality page and has a lot of external links. We hope that pageviews statistics will help to display more "obvious results". **5. Exact match first (question)** Should we make //exact match// ranked first? This is extremely difficult to make exact match ranked first (without adding a new query overhead) for all queries. With prefix search ambiguous queries like "goo" are not returning exact match first but will return exact match more often than the completion suggester. If we want to fix this issue solutions could be: # include the suggestion length in the score (e.g. discount suggestion based on their length), small suggestion will tend to be displayed more often (extremely hard to predict and tune) # hack the suggestions by adding an invisible char at the end of the suggestion, searching for //cpu// will actually search for //cpu_// and force all other suggestions to be fuzzy, this will certainly display exact match first but will dramatically reduce recall... # post search //re-scoring//, if the exact match is returned by the suggester : arbitrarily rank it higher to make sure it's the first result. I would suggest the 3rd option. **6. Stopwords weirdness (easy fix)** The current stopwords filtering strategy is very bad, the completion suggestion fails miserably on the **to be or not to be** test. Proposed fix: do not remove stopwords from the user query. Current implementation runs a suggestion request without stop words and is causing these weird results, we could try not to remove stopwords from user query but still remove them from the index: - With a page named **The Foo Fighters**, searching for **foo fighters** will match. - With a page name **Foo Fighters**, searching for **The foo fighters** won't match, we'll lose this specific use case. **7. Punctuation (easy fix)** Searching for **U=RI** (Ohm's law) won't display the correct search result. This is due to the analysis chain which has been optimized for recall. Recall seems to be greatly improved so we can tune this analysis chain to be a bit more "precise". Proposed fix: switch to a custom white space analyzer for //plain// matches and keep standard analyzer for //plain_stop// suggestions. **8. ASCII Folding (easy fix)** Searching for **ü** will display results in the same order as if you searched **u**, it's particularly annoying when you make the effort to write diacritics. The proposed fix is similar to the previous one: remove ASCII folding from //plain// suggestion and keep it for //plain_stop//.