Page MenuHomePhabricator

Enable more of the unambiguous/less ambiguous scripts for language identification
Open, MediumPublic


The languages currently enabled for identification on various Wikipedias are based on what was present in a ~2K sample of queries. Enabling additional languages not present in the sample can cause false positives and false negatives, and so was generally avoided.

However some scripts are unambiguous (or at least overwhelming likely to be one particular language) so that enabling them will create no new false negatives, and—though they would apply rarely—they could provide additional true positive identification results.

  • Bengali (bn), Greek (el), Hebrew (he), Armenian (hy), Georgian (ka), Korean (ko), Burmese (my), Telugu (te), and Thai (th) have been enabled for some wikis and are generally unambiguous
  • Other relatively unambiguous language/script pairs for which we have models include Tamil (query-based), and Gujarati & Oriya (wiki-text–based).

Other languages do not have particularly unambiguous scripts, but are the overwhelmingly likely language when no other examples in that script appear in the sample. For example, if there are no Cyrillic queries in a sample, then future Cyrillic queries are unlikely to occur, but if they do, they are more likely to be Russian, just because Russian is more common in general.

  • Arabic (ar), Hindi (hi), Russian (ru) are the overwhelming best guesses for the rare queries that show up in their respective scripts.
  • Japanese (ja) is an unusual case because hiragana and katakana are unambiguously Japanese, while the kanji are borrowed Chinese characters, so it would require more careful consideration and testing before being enabled.

So the task would be to enable the "unambiguous" languages in all nine wikis with language ID enabled, and add Arabic, Hindi, and Russian where they are not currently enabled, and test. Also investigate Japanese, though it may cause inaccurate results in competition with Chinese.