The overall goal of this task is to improve the quality and relevance of the “Did you mean…” suggestions provided to searchers on Wikipedia and other projects, across as many languages as we can (subject to time limits and language/query data availability).
Specifically, goals include:
* Analyzing spelling mistakes people might be making when querying and providing results based on corrected spelling errors, and;
* Improving our “Did You Mean” suggestions that provide search options similar to the determined query intent when there are no or few results.
The currently identified approaches include:
* **Method 0:** Mine search logs for query + correction pairs and create an efficient method for choosing candidates to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages/projects with sufficient search traffic.
* **Method 1:** Mine search logs for common queries and create an efficient method for choosing candidates to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages with relatively small writing systems (alphabets, abjads, syllabaries, etc.).
* **Method 2:** Use resources external other than search logs (e.g., dictionaries with word frequencies) as a source for spelling corrections, using existing open source proximity/spell checking algorithms. Only applicable to languages with relevant linguistic resources.
Currently identified high-level phases of the project include
* NLP contractor set up and access (T<TBD>)
* Implement Method 0 for English (T<TBD>)
* Implement Method 1 for 10 languages (T<TBD>)
* Implement Method 2 for CJK languages (T<TBD>)