Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could be tested with data from T121539 or T121541 with current best language identification module)
Estimate: 4 hours per language to extract and clean data, and build & test models.