Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than T121544, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on T121539 for a reasonable evaluation set, using current best language identification module)
Need list of target languages to work on (based on T121539).
Estimate: 2-3 days to work out an mostly automate a process on the first language tested.
Very rough preliminary estimate: 4-8 hours per language after that to filter queries, and build and evaluate models. Might be better to work on at least 3-4 important, low performing languages at once, since improvements to any can improve results for all (i.e., Romanian stops stealing English)