"Did you mean . . ." search feature to automatically spellcheck all search terms
OpenPublic

Description

Author: NathanLarson32767

Description:
It would be helpful for Wikipedia to detect typos similar to Google. For instance, if I
type in "Nathaz Larson" on Google, it completes the search but also shows a hyperlink
saying "Did you mean Nathan Larson?"

This would save users time re-entering mistyped search strings.


Version: unspecified
Severity: enhancement

bzimport added a project: MediaWiki-Search.Via ConduitNov 21 2014, 7:04 PM
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz974.
bzimport created this task.Via LegacyDec 2 2004, 4:56 AM
bzimport added a comment.Via ConduitMay 24 2005, 11:20 PM

avarab wrote:

*** Bug 2236 has been marked as a duplicate of this bug. ***

brion added a comment.Via ConduitSep 11 2005, 2:44 AM
  • Bug 3418 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitSep 26 2005, 5:31 PM

muthus_post wrote:

I think we must integrate Aspell with Mediawiki to suggest the spelling,
incase we think the user typed string is *NOT* an exact match, or has
more than one match. Now it makes a lot of sense to have something like
aspell to have the suggestions of strings, and throw them on the user,
if the above conditions are met.

brion added a comment.Via ConduitSep 26 2005, 7:34 PM

The Lucene search includes a fuzzy match feature which does similar; it's currently
disabled for speed reasons.

bzimport added a comment.Via ConduitSep 26 2005, 10:42 PM

muthus_post wrote:

If 'speed reasons' include, the time to complete search,
then Id guess, its not worth disabling. YOu must enable this feature.

Otherwise, youre worried about server usage, etc then probably I think
its an issue.

I think its rational to think that, someone will be more-satisfied if
a search query returns with, some useful results, AND take some time
for it, rather than, turning up nothing in quick time. Im saying
if we can get some results, that will be relevant, its worth trying for..

brion added a comment.Via ConduitSep 27 2005, 2:30 AM

time to complete search ~= server usage

bzimport added a comment.Via ConduitOct 7 2005, 7:50 AM

avarab wrote:

*** Bug 3638 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitOct 14 2005, 10:06 AM

gangleri wrote:

¿is this the same as?

bug 2486: Automatic wiki page name suggestion similar as "Google Suggest"

bzimport added a comment.Via ConduitMar 13 2006, 4:28 PM

Tuukka wrote:

Could we consider some algorithm that takes less server resources? For example,
keep a mapping from "lowercaseletteronly" version to any matching "Lower-case
Letter Onlys".

There could also be such a mapping for What links here, Wiktionary, and Commons
links so the "no such article" page could tell if the links actually lead to
something.

bzimport added a comment.Via ConduitApr 11 2006, 9:48 AM

gangleri wrote:

*** Bug 5533 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitMay 14 2006, 8:59 PM

vossman77 wrote:

*** Bug 3139 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitJul 11 2006, 5:28 PM

petdog wrote:

Can't you just use the levenshtein function (which is also included in php >=
3.0.17) against the list of titles (of the wiki pages), and choose the [two]
minimum results as the suggested keyword/s?
I hope that wouldn't be too much of a work for the servers, because it would for
sure be a very good thing for the users :D

Sorry for my BAD english, I hope you understand.

__
petdog@gmail.com

bzimport added a comment.Via ConduitJul 11 2006, 6:00 PM

harald wrote:

*** This bug has been marked as a duplicate of 2486 ***

bzimport added a comment.Via ConduitJul 11 2006, 9:16 PM

vossman77 wrote:

*** Bug 2486 has been marked as a duplicate of this bug. ***

daniel added a comment.Via ConduitJul 11 2006, 9:20 PM

@petdog: run the levenshtein function agains all titles in a wiki for every
search? The english wikipedia has more than a million entries, and several
searches per second.... That would never work. It may be nice for a very small
wiki though - shouldn't be to hard to write a search plugin that does it.

bzimport added a comment.Via ConduitJul 11 2006, 9:54 PM

kop wrote:

Why would it be any harder to put levenshtein function values into a database
and search them than it is to put regular words into a database and search them?
Is the load on the servers so high that you can't levenshtein-ize the user's
input to search against the data?

daniel added a comment.Via ConduitJul 11 2006, 9:57 PM

@Karl: that's not how it works. The levenshtein function takes two parameters:
the search term and the title. You can't pre-compute distances for all possible
search terms.

bzimport added a comment.Via ConduitJul 11 2006, 10:28 PM

kop wrote:

My bad. Ah well. But you could still use soundex, although you'd have to find
equalivent functions for languages other than english.

Humm....

While you can't pre-compute distances for all possible search terms, you might
be able to pre-compute distances for search terms that are entered by users and
not found. I'm sure google is doing something like this, using the user's input
to guide the search process. Alternately, you could just watch what the users
do. If they don't find something, then they do, and the levenshtein distance is
small, save the found page as a suggestion. Or whatever. There's lots of
information supplied to wikipedia by the user's searches. Research project?
Maybe. Unique opportunity, surely.

brion added a comment.Via ConduitJul 22 2006, 10:25 PM
  • Bug 6778 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitNov 10 2006, 2:43 PM

Eneas wrote:

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

brion added a comment.Via ConduitJan 18 2007, 10:10 PM
  • Bug 8694 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitJan 30 2007, 10:02 PM

jasonspiro4 wrote:

(In reply to comment #20)

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a

table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me. Site admins: do you think it would work?
(In reply to comment #5)

[...] someone will be more-satisfied if
a search query returns with, some useful results, AND take some time
for it, rather than, turning up nothing in quick time.

I agree. IMO, spell checking is so useful for so many users that even if it
slowed all searches down by twenty-five percent, it would still be worthwhile.
Should I perhaps take a poll of a few fellow Wikipedia readers at my community
college and they think?

(In reply to comment #20)

What about creating a list with unsuccesful search terms and running new entries
each week/month against the levenshtein function. The results are saved in a

table.

When you perform a search with a mistyped search term that is equal to the list
of unsuccessful searches you get the pre-computed result of the algorithm.

Sounds like a fine workaround to me. Site admins: do you think it would work?

I am taking the liberty of bumping this bug's priority to High.

Dcoetzee added a comment.Via ConduitJun 17 2007, 7:54 AM

Don't go reinventing the wheel. This is a generic query task that could apply to any MySQL database. Lucene is ideal, but I can see why it's disabled. One relatively straightforward approach is to do the Levenshtein distance thing, but first use a spatial query to narrow down the number of points you have to evaluate. Search terms (and titles) are associated with vectors specifying the frequency of each letter. We then perform a spatial query using a spatial index to find the nearest 50 or so points in this space, and do Levenshtein against each of them for the final ranking. Unfortunately, the OpenGIS spatial support appears to be for only low numbers of dimensions. Nevertheless this could be carried out by keeping a separate spatial index "off to the side" outside of MySQL and updating it as titles are updated. Just an idea.

bzimport added a comment.Via ConduitJun 17 2007, 8:01 AM

lowzl wrote:

The developers may already be aware of this, but there is a Levenshtein distance user-defined function in C for MySQL: http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz

Though as the previous poster points out, it would still be necessary to narrow down the number of entries to consider for a large database like Wikipedia.

bzimport added a comment.Via ConduitJun 17 2007, 8:08 AM

ayg wrote:

Does Rainman's Lucene 2.0 extension address this?

bzimport added a comment.Via ConduitJun 17 2007, 10:24 AM

rainman wrote:

Simetrical: Not yet. Probably in 2.1.

Derrick: One way of narrowing down the results is using ngram indexes. Using ngrams does a fair job in determining if two string are statistically similar, and it's well-suited for spelling errors because it preserve linkage between adjacent letters (I'm not sure if this is the case with letter-frequency based spatial indexes as well?). So, you could make a query on ngram lucene index, fetch first 50 results, and then do some sophisticated analysis trying to figure out which word(s) suite the query best.

There are some caveats however. One is that you want spell-checking to be context-dependent, i.e. the spell checker should always look at the phrase, not individual words. Indexing all phrases is probably not feasible. Possible solution might be to build ngram index in iterations, by adding phrases that could be confused with long words or such. Some other caveats are that you might want to favor spelling that correspond to article titles, that correspond to some interwiki link (in which case you also might want a translation), that you might want to do spell checking even if search returns some results, and so on..

However, this is still theory. I guess it makes a huge difference when you go from a small system of few hundreds of MB of text to a system with tens of GB of text, and want the spell checker always to find the most relevant article and words.

werdna added a comment.Via ConduitJun 29 2008, 8:36 AM
  • Bug 14680 has been marked as a duplicate of this bug. ***
brion added a comment.Via ConduitAug 4 2008, 5:59 AM
  • Bug 15021 has been marked as a duplicate of this bug. ***
brion added a comment.Via ConduitDec 28 2008, 9:54 PM

Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay!

However there's not yet support in the default MySQL backend.

bzimport added a comment.Via ConduitDec 30 2011, 4:18 PM

rainman wrote:

Note that the votes are for enabling this on WMF wikis (which is done), not necessarily to implement it in MySQL.

mxn added a subscriber: mxn.Via WebNov 24 2014, 8:51 PM
Nemo_bis added a subscriber: Nemo_bis.Via WebJan 16 2015, 11:31 PM
In T2974#36573, @brion wrote:

Note that the "did you mean" thingy is now active in the Lucene search backend used on Wikipedia. Yay!

However there's not yet support in the default MySQL backend.

Is it actually possible to do such a thing with the MySQL backend? That is, is there some standard function/library/whatever to do it? Advanced search features are what Lucene, ElasticSearch etc. are for; it's not unreasonable to rule them out from the scope of the default MySQL search if we'd have to build them ourselves.

Add Comment