I suggest closing T134430 as too much effort and instead pursuing an easier approach. Based on my analysis of several wikis, and Erik's suggestion that we find one set of languages to use for language detection for the long tail, I suggest that for the next M wikis (beyond the "top N" in T121541)—down to some limit based on size and/or query volume—we use a standard set of language models, plus a Wiki-Text model (T121545) based on that wiki.
The Wiki-Text models are not as accurate on query strings as query-based models, but they are easy to generate mostly automatically.
We could deploy a standard list of languages based in part on likelihood of being encountered (English seems to be everywhere) and uniqueness of character set (Greek is generally Greek). My current suggested list would be Arabic, Armenian, Chinese, English, Greek, Hebrew, Japanese, Korean, Russian, and Thai, plus a wiki-text model for the language of the wiki.
I'd suggest a staged roll out to see what kind of feedback we get. If we get reports of mis-identifying languages, we could add or remove models as necessary. If we don't get any feedback, then either the results are acceptable, or no one is using it.
Additional features required/desired:
- figure out a way to mix query-based and wiki-text-based models (simple solution: copy wiki-text models to query-based model directory and note which is which in the docs; more complex solution: allow TextCat to take more complex specifications across model directories) [required]
- generic feedback mechanism to allow users to easily rate language detection / results and flag instances where things go wrong. (need to think about UI and logging—translate to all the required languages, or go for generic icons (e.g., smiley face, neutral face, frownie face); and can we log queries that get poor marks from users to investigate later?) [highly desired]