Improve Language Identification Training Data via Application of Language Models to the Training Data
Closed, DeclinedPublic

Description

Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than T121544, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on T121539 for a reasonable evaluation set, using current best language identification module)

Need list of target languages to work on (based on T121539).

Estimate: 2-3 days to work out an mostly automate a process on the first language tested.

Very rough preliminary estimate: 4-8 hours per language after that to filter queries, and build and evaluate models. Might be better to work on at least 3-4 important, low performing languages at once, since improvements to any can improve results for all (i.e., Romanian stops stealing English)

Related Objects

TJones created this task.Dec 15 2015, 5:52 PM
TJones updated the task description. (Show Details)
TJones raised the priority of this task from to Needs Triage.
TJones added a project: CirrusSearch.
TJones added a subscriber: TJones.
Restricted Application added a project: Discovery. · View Herald TranscriptDec 15 2015, 5:52 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana triaged this task as Normal priority.
Deskana added a subscriber: Deskana.
ksmith moved this task from On Sprint Board to Search on the Discovery board.Feb 16 2016, 11:24 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 11 2016, 10:40 PM
debt closed this task as Declined.Aug 4 2016, 7:01 PM
debt added a subscriber: debt.

From a conversation with @TJones:

This was another method I came up with to improve the quality of training data. I think we have decent models for the “big” languages now, and wikitext models are easier to generate, and the extra few percentage points of accuracy aren’t worth the work for no-effort wikipedias. And the basic technique is available and I use it already. It doesn’t need to be formalized like this.