At least some of the training sets for languages used with TextCat in T118287 are pretty crappy because they have lots of other languages in them. Create larger manually "curated" training sets (~20K entries) for languages with really crappy training data (e.g., Igbo) that's contaminated with English and other junk. (could depend on and be gated by the results of T121545, T121546, and T121547; could be tested via re-test of data in T121539 or T121541 with current best language identification module)
Note that who can review this data is limited because it potentially contains PII. (Unfortunately!!)
From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Farsii, Korean, Swedish, Vietnamese, Ukranian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.)
Estimate: 2-4 days per language, for someone familiar with the language.