Page MenuHomePhabricator

Automatically switch to user's query language if user types characters associated with only one language
Closed, DeclinedPublic

Description

Quoth the @Ijon: "Here's a relatively easy improvement: if the user types more than one character unambiguously associated with another language (relatively few cases, e.g. Hangul for Korean), automatically switch the language and search in that language."

Neat little idea. Little quality of life improvements like this often don't really move the needle, but they're also often quite simple to implement and contribute a lot to user delight. Worth considering!

Event Timeline

If we want to go beyond a set of rules/regexes, libraries like https://github.com/jprante/elasticsearch-langdetect might help.

debt triaged this task as Medium priority.Mar 14 2016, 11:52 PM
debt subscribed.

Hi @Ijon and @whym - this is great stuff, we'll take a look at it!

Libraries like @whym suggested can be great for language detection on larger texts, but terrible for really short texts (like tweets and search queries). We're currently working on detecting the languages of queries and redirecting to another wiki when they get no results, but the only way to get decent results is to limit the scope of languages under consideration, which might not make sense on the portal. Also, queries seem to differ from other texts, so query-specific language detection might be better (i.e., what we've done with TextCat) might be better if you want to go that way.

There are definitely a few character sets that indicate a particular language, like @Ijon suggested—Korean, some Japanese characters, a few Persian characters, Thai, Armenian, Georgian, and other full character sets. There are other individual letters that are specific to particular languages, too. If you look at the list of Cyrillic letters, a few are specific to particular languages. Similarly with the list of Latin letters, there are many vowels with certain diacritics that are very specific to Vietnamese, for example.

You could do some analysis of traffic and see what languages are most commonly typed in searches and make some probabilistic guesses. So, if 99.99% of Arabic-script queries are in Arabic then maybe switching from English to Arabic is not a bad guess. OTOH, if it's 33% Arabic, 33% Persian, and 33% Urdu, guessing might be worse than doing nothing. And you probably shouldn't switch, e.g., from Persian to Arabic or vice versa, without a really good reason, since the writing systems overlap so much.

We could try to do some research or analysis and look for very distinct n-grams for particular languages, possibly based on the TextCat language models we've gotten so far—but that starts getting into a fairly heavyweight process.

You could definitely start with the fairly unambiguous cases.

One last consideration: there is a non-zero number of nerds (there's me—that's 1!) who are very happy to search for "wrong-language" queries on a wiki, especially names of people and places, because it works pretty well! e.g., search enwiki for ประเทศไทย or Москва́.

moving to the backlog for now - we don't have the ability on our team to tackle this right now.

One last consideration: there is a non-zero number of nerds (there's me—that's 1!) who are very happy to search for "wrong-language" queries on a wiki, especially names of people and places, because it works pretty well! e.g., search enwiki for ประเทศไทย or Москва́.

I would sign to this nerdy list too. And that's why I believe the change should not be implemented. Besides such cases when typing wrong language helps looking for, there are also full articles on Wikipedias about characters peculiar to some languages. A person might be interested in looking for those, rather than having search switched to that language instead

Closing this ticket - we've been implementing TextCat language detection in CirrusSearch that will help a user get search results when we have detected that they've typed in a query that is in a language different than the wiki that they are on. Here's one of the most recent releases to production: T142413 and the parent task: T118278.