Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could be tested with data from T121539 or T121541 with current best language identification module)
T121545 is probably easier and may create decent models for some languages. This and T121545 could obviate the need for T121544 for some languages.
Estimate: 4 hours per language to extract and clean data, and build & test models.