Improve Language Identification Training Data via Application of Language Models to the Training Data
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	TJones
	Dec 15 2015, 5:52 PM

Description

Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than T121544, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on T121539 for a reasonable evaluation set, using current best language identification module)

Need list of target languages to work on (based on T121539).

Estimate: 2-3 days to work out an mostly automate a process on the first language tested.

Very rough preliminary estimate: 4-8 hours per language after that to filter queries, and build and evaluate models. Might be better to work on at least 3-4 important, low performing languages at once, since improvements to any can improve results for all (i.e., Romanian stops stealing English)

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)

Event Timeline

TJones created this task.Dec 15 2015, 5:52 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:52 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones mentioned this in T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.Dec 15 2015, 5:53 PM

TJones added a parent task: T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.

TJones updated the task description. (Show Details)Dec 15 2015, 5:55 PM

TJones set Security to None.

TJones added subtasks: T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume, T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:24 PM

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:27 AM

TJones mentioned this in T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Jan 6 2016, 11:37 PM

• Deskana triaged this task as Medium priority.Jan 14 2016, 5:44 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

EBernhardson removed a project: Discovery-Search (Current work).Feb 16 2016, 11:12 PM

• ksmith moved this task from On Sprint Board to Search on the Discovery-ARCHIVED board.Feb 16 2016, 11:24 PM

• Deskana closed subtask T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume as Resolved.May 11 2016, 10:40 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 11 2016, 10:40 PM

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.May 12 2016, 10:06 PM

From a conversation with @TJones:

This was another method I came up with to improve the quality of training data. I think we have decent models for the “big” languages now, and wikitext models are easier to generate, and the extra few percentage points of accuracy aren’t worth the work for no-effort wikipedias. And the basic technique is available and I use it already. It doesn’t need to be formalized like this.

debt closed subtask T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis as Resolved.Apr 5 2017, 2:51 PM

Improve Language Identification Training Data via Application of Language Models to the Training DataClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve Language Identification Training Data via Application of Language Models to the Training Data
Closed, DeclinedPublic
Actions

Related Objects
Search...