Experiment with Equalizing Training Set Sizes for Language Identification
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	TJones
	Dec 15 2015, 5:52 PM

Description

Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could be tested with data from T121539 or T121541 with current best language identification module)

T121545 is probably easier and may create decent models for some languages. This and T121545 could obviate the need for T121544 for some languages.

Estimate: 4 hours per language to extract and clean data, and build & test models.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)

Event Timeline

TJones created this task.Dec 15 2015, 5:52 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:52 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones mentioned this in T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.Dec 15 2015, 5:53 PM

TJones added a parent task: T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.

TJones mentioned this in T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones updated the task description. (Show Details)

TJones set Security to None.

TJones added subtasks: T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume, T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis, T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:27 AM

• Deskana triaged this task as Medium priority.Jan 27 2016, 11:09 PM

• Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

• Deskana subscribed.

• Deskana closed subtask T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume as Resolved.May 11 2016, 10:40 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 11 2016, 10:40 PM

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.May 12 2016, 10:06 PM

Based on a conversation with @TJones, moving this to the 'later' column to re-consider it in a quarter or two after we’ve done other stuff...and here's why:

Larger training sets make for higher resolution frequency tables, which could give better detection accuracy. The training sets don’t need to be entirely equalized, but the smaller ones could be significantly enlarged.

There are a few options:

We could do one or more tests to see whether increased model size improves accuracy to make a guess at the impact it would have. We’d have to find a good test case where a smaller model doesn’t perform that well, but that shouldn’t be too hard.

We could compare accuracy of query-based models built on small training sets to wikitext models and see how much the small models are worth. If they don’t give a big advantage, we could drop them and use the wikitext models.

debt closed subtask T121545: Wikipedia-Text–Based Language Models for Language Identification as Resolved.Aug 8 2016, 4:40 PM

Liuxinyu970226 subscribed.Mar 9 2017, 3:19 PM

debt closed subtask T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis as Resolved.Apr 5 2017, 2:51 PM

Declining this ticket, as what we've done in T121541 seems to be working well enough. We can reopen this ticket if there is desire to do more for general language ID.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptApr 5 2017, 2:53 PM

RandomDSdevel awarded a token.Jun 17 2017, 10:14 PM

Experiment with Equalizing Training Set Sizes for Language IdentificationClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Experiment with Equalizing Training Set Sizes for Language Identification
Closed, DeclinedPublic
Actions

Related Objects
Search...