Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Dec 15 2015, 5:44 PM

Description

We can do a better more balanced assessment of the new language models (T118287) to decide which ones are really not good (e.g., probably Igbo) and which ones are just not appropriate for enwiki (e.g., hopefully French and German).

The obvious approach is to create a "fair" evaluation test set with equal numbers of examples for each language (say, 100 random queries for each language, manually reviewed to make sure they are in the proper language), and evaluating performance on that set.

Randomly sample ~1000 queries from a given wiki, randomize their order, and delete non-target language queries (junk, names, DOI, obvious bots, other languages) until there are 100 good queries. Repeat for N languages, sorted by query volume.

From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Farsii, Korean, Swedish, Vietnamese, Ukranian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.)

Estimate is 1 hour for fresh data extraction and setup, plus 1-2 hours per language for most languages.

Details

	Subject	Repo	Branch	Lines +/-
	Add newly validated query-based language models to TextCat	wikimedia/textcat	master	+110 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)

Event Timeline

TJones created this task.Dec 15 2015, 5:44 PM

TJones raised the priority of this task from to High.

TJones updated the task description. (Show Details)

TJones added projects: Discovery-ARCHIVED, CirrusSearch.

TJones subscribed.

Restricted Application added subscribers: revi, Josve05a, Aklapper. · View Herald TranscriptDec 15 2015, 5:44 PM

TJones mentioned this in T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.Dec 15 2015, 5:47 PM

TJones added a parent task: T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.

TJones mentioned this in T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.Dec 15 2015, 5:53 PM

TJones added a parent task: T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.

TJones mentioned this in T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones added a parent task: T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones mentioned this in T121546: Experiment with Equalizing Training Set Sizes for Language Identification.

TJones added a parent task: T121546: Experiment with Equalizing Training Set Sizes for Language Identification.

TJones mentioned this in T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data.Dec 15 2015, 5:55 PM

TJones added a parent task: T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:22 PM

• Deskana set Security to None.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

TJones mentioned this in T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Jan 6 2016, 11:37 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.Jan 14 2016, 5:41 PM

TJones claimed this task.Jan 26 2016, 12:03 AM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I've completed the creation of a 21-language balanced (i.e., 200 each) corpus of relatively clean queries for use in evaluating language identification model testing. The 21 languages were chosen based on query volume across wikis in those languages. I've also evaluated our current version of TextCat against this corpus, using the known 21 languages, and all 59 languages I have models for.

The 21 languages have pretty good models, because they had lots of query volume to be built on. The full set of 59 is a bit more dodgy, esp. Igbo, which is known to have a lot of English in the training data.

Indonesian is the most unexpectedly poor performing of the bunch (most other poor performance is across language or script families and so is expected).

The best model size among those test (500 to 10K), was the full 10,000! However performance at the 3,000 ngram model size (what we've been using for A/B tests) was only a few percentage points worse.

Full write up with lots more details here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries

I'll commit models for the rest of these 21 languages after verifying that they won't mess up our A/B tests.

Change 275051 had a related patch set uploaded (by Tjones):
Add newly validated query-based language models to TextCat

https://gerrit.wikimedia.org/r/275051

gerritbot added a project: Patch-For-Review.Mar 4 2016, 9:02 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Mar 4 2016, 9:04 PM

TJones mentioned this in T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.Mar 4 2016, 9:08 PM

Change 275051 merged by jenkins-bot:
Add newly validated query-based language models to TextCat

https://gerrit.wikimedia.org/r/275051

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 7 2016, 4:35 PM

• Deskana closed this task as Resolved.May 11 2016, 10:40 PM

• Deskana lowered the priority of this task from High to Medium.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMay 11 2016, 10:40 PM

• ksmith removed a project: Discovery-Search.Aug 25 2016, 8:39 PM

Create Balanced Language Identification Evaluation Set for Top N Wikis by Query VolumeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Closed, ResolvedPublic
Actions

Related Objects
Search...