Page MenuHomePhabricator

Experiment with Equalizing Training Set Sizes for Language Identification
Closed, DeclinedPublic

Description

Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could be tested with data from T121539 or T121541 with current best language identification module)

T121545 is probably easier and may create decent models for some languages. This and T121545 could obviate the need for T121544 for some languages.

Estimate: 4 hours per language to extract and clean data, and build & test models.

Related Objects

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.
Deskana subscribed.
debt lowered the priority of this task from Medium to Lowest.Aug 4 2016, 6:48 PM
debt moved this task from Up Next to search-icebox on the Discovery-Search board.
debt subscribed.

Based on a conversation with @TJones, moving this to the 'later' column to re-consider it in a quarter or two after we’ve done other stuff...and here's why:

Larger training sets make for higher resolution frequency tables, which could give better detection accuracy. The training sets don’t need to be entirely equalized, but the smaller ones could be significantly enlarged.

There are a few options:

  • We could do one or more tests to see whether increased model size improves accuracy to make a guess at the impact it would have. We’d have to find a good test case where a smaller model doesn’t perform that well, but that shouldn’t be too hard.
  • We could compare accuracy of query-based models built on small training sets to wikitext models and see how much the small models are worth. If they don’t give a big advantage, we could drop them and use the wikitext models.

Declining this ticket, as what we've done in T121541 seems to be working well enough. We can reopen this ticket if there is desire to do more for general language ID.