Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	TJones
	Dec 15 2015, 5:51 PM

Description

At least some of the training sets for languages used with TextCat in T118287 are pretty crappy because they have lots of other languages in them. Create larger manually "curated" training sets (~20K entries) for languages with really crappy training data (e.g., Igbo) that's contaminated with English and other junk. (could depend on and be gated by the results of T121545, T121546, and T121547; could be tested via re-test of data in T121539 or T121541 with current best language identification module)

T121545, T121546, and T121547 are potentially less expensive (though less exhaustive and less accurate) methods, and probably should be tried first.

Note that who can review this data is limited because it potentially contains PII. (Unfortunately!!)

From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Farsii, Korean, Swedish, Vietnamese, Ukranian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.)

Estimate: 2-4 days per language, for someone familiar with the language.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)

Event Timeline

TJones created this task.Dec 15 2015, 5:51 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:51 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones triaged this task as Low priority.Dec 15 2015, 5:53 PM

TJones updated the task description. (Show Details)

TJones set Security to None.

Restricted Application added subscribers: revi, Josve05a. · View Herald TranscriptDec 15 2015, 5:53 PM

TJones added subtasks: T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume, T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis, T121545: Wikipedia-Text–Based Language Models for Language Identification, T121546: Experiment with Equalizing Training Set Sizes for Language Identification, T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data.Dec 15 2015, 5:53 PM

TJones mentioned this in T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones mentioned this in T121546: Experiment with Equalizing Training Set Sizes for Language Identification.

TJones mentioned this in T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

• Deskana raised the priority of this task from Low to Medium.Jan 27 2016, 11:09 PM

• Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

• Deskana subscribed.

• Deskana closed subtask T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume as Resolved.May 11 2016, 10:40 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 11 2016, 10:40 PM

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.May 12 2016, 10:06 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jun 29 2016, 4:12 PM

I'm not sure whether we should pursue this task right now. When I created this big maze of related language ID tasks, I was brainstorming about what was possible and how things are related. Not everything linked to anything else is necessary, but there may be a preferred order of operations for various tasks.

This is a lot of work per language—half a week to a week if you are familiar with the language, maybe double if not, and I was only guessing that 20K examples would be enough to create a decent model.

Now having more experience with different groups of languages searched in on different wikis, I think maybe this isn't worth doing for the really ugly query corpora (e.g., those where almost half the queries are in English but the wiki isn't in English).

Query-based models do get a few more percentage points of accuracy than wiki-text based models (e.g., no one will type pq for porque in an article on eswiki or ptwiki, but users do it all the time in queries, so models accounting for that will perform better). The biggest improvement, though, is from restricting the language set applied to a corpus (i.e., the language evaluations I'm working on for the bigger wikis).

Maybe we should review all these tasks again and think about what's worth doing and what's not, what's the biggest bang for the buck, and clarify our priorities on where to focus effort (bigger wikis with more users, or smaller wikis on the long tail), and whether to explore other less effortful or more automated options.

moving this to the backlog board - we're doing a lot of this work in the individual tickets that are listed in the description.

debt moved this task from Up Next to search-icebox on the Discovery-Search board.Aug 2 2016, 6:43 PM

debt closed subtask T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data as Declined.Aug 4 2016, 7:01 PM

From a conversation with @TJones:

Query data makes better models for identifying queries than “normal” text data. Sometimes it’s easy to gather, other times not so much (usually because it’s hard to strip out most of the wrong-language queries). The idea here was to put more work into creating a query-based corpus. I don’t think it’s worth it for Wikipedias down the long tail. They don’t have the same level of usage, and it’s more work than I originally thought.

debt closed subtask T121545: Wikipedia-Text–Based Language Models for Language Identification as Resolved.Aug 8 2016, 4:40 PM

debt closed subtask T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis as Resolved.Apr 5 2017, 2:51 PM

debt closed subtask T121546: Experiment with Equalizing Training Set Sizes for Language Identification as Declined.Apr 5 2017, 2:53 PM

Create Manually "Curated" Training Sets for Top N Languages for Language IdentificationClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Closed, DeclinedPublic
Actions

Related Objects
Search...