Change Details

If we want to deploy language detection to maximum effect on wikis beside enwiki, we need to know what languages are most often used there (in failed queries), and limit language detection to "valuable" languages. E.g., on enwiki, there aren't that many French queries, and many more queries are incorrectly identified as French than correctly identified, making it a net loss. Obviously, we'd need French on frwiki. We can generally work this out to within a few percent with a sample of 1,000. Work on top N languages and determine the best mix of languages to use for each of them. Each evaluation set would be a set of ~1,000 zero-results queries from the given wiki, manually tagged by language. It takes half a day to a day to do if you are familiar with the main language of the wiki, and evaluation on a given set of language models takes a couple of hours at most. (depends on T121539 to make sure we aren't wasting time on a main language that does not perform well) From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Persian, Korean, Swedish, Vietnamese, Ukrainian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.) Estimate is roughly a 3/4 dayday and a half per wiki to generate an evaluation set and evaluate it against our current best language identification tools and select the right mix of languages for that tool set.