Though T118287 shows that query-based text models work better than wiki-text–based language models, getting good query data can be hard. So, we should see if Wikipedia-based language models for languages with crappy training data do better. (could obviate the need for T121544 or T121546 in some cases; could be tested via re-test of data in T121539 or T121541 with current best language identification module)
Need list of target languages to work on (based on T121539).
Estimate: 2-4 hours per language to create corpuse, do minimal cleanup, and build & test model(s).