Page MenuHomePhabricator

EPIC: Improve results when users enter searches in other languages
Closed, ResolvedPublic

Description

Data and anecdotes suggest that a large number of "zero results" are because users entered a search in a language other than the one supported by the wiki they searched on.

We are thinking we might want to auto-detect the language and respond accordingly. But specifics have not yet been decided.

Event Timeline

ksmith raised the priority of this task from to High.
ksmith updated the task description. (Show Details)
ksmith added subscribers: Deskana, Aklapper, ksmith.

Language detection on very small Strings does not work well, the plugin mentioned here uses the the detector made by the Cybozu Lab. In my previous job we used this detector to do language detection on content we collected. It works very well on content like webnews but its performance on tweets was very bad. They added new profiles trained against tweets (never tested) but I'm quite pessimistic of good performances on search queries.
Could we do first some evaluations?

Evaluate if the search query applied to the proper language wiki would return results :

  • Take the top-N "no result" search queries
  • Run them on all other wikis
  • Count the number of hits per wiki

This would result in a matrix like :

originnumber of "zero result" queriesenfrdejp..."Zero result" decreased by
fr1387123N/A212%
en293847N/A234324294%
de284723434N/A23%
......

This would allow us to evaluate if running the "no result" query in another wiki can significantly reduce the number of "no result" and how much it helps to achieve Q1 goals.

We will be able to know what's the best "fallback wiki" to run the query when we encounter "no result":

  • fr: best fallback wikis is en, then de
  • de: best fallback wikis is en, then jp

Language detection becomes an "optimization" to choose the proper "fallback wiki" and not the initial condition.

@dcausse: Great analysis, and I think a test along those lines would be worthwhile (assuming it's not too difficult).

However, I would think that single-word language detection would do poorly in terms of finding a single language but excellent at finding a small set of possible languages. By extension, I would expect similar results for short strings. So rather than jumping immediately to searching all other wikis, I would be interested in searching only that small set of "candidate" wikis.

Or maybe language detection libraries aren't designed to return multiple possible languages?

I totally agree, a language detector will help to limit the query to a subset of other wikis. Cybozu lang detector returns a list of candidates with a "confidence score" (note that I don't know how it behaves within the elasticsearch plugin).

Deskana claimed this task.

I think this task has long since served its purpose. Language detection was significantly improved with TextCat and other efforts. Any future work to improve language detection can be put under another epic. Enough has been done to call this resolved.