Page MenuHomePhabricator

Improve Language Identification Training Data via Application of Language Models to the Training Data
Closed, DeclinedPublic

Description

Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than T121544, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on T121539 for a reasonable evaluation set, using current best language identification module)

Need list of target languages to work on (based on T121539).

Estimate: 2-3 days to work out an mostly automate a process on the first language tested.

Very rough preliminary estimate: 4-8 hours per language after that to filter queries, and build and evaluate models. Might be better to work on at least 3-4 important, low performing languages at once, since improvements to any can improve results for all (i.e., Romanian stops stealing English)

Related Objects

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.
Deskana subscribed.
debt subscribed.

From a conversation with @TJones:

This was another method I came up with to improve the quality of training data. I think we have decent models for the “big” languages now, and wikitext models are easier to generate, and the extra few percentage points of accuracy aren’t worth the work for no-effort wikipedias. And the basic technique is available and I use it already. It doesn’t need to be formalized like this.