Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Dec 15 2015, 5:47 PM

Description

If we want to deploy language detection to maximum effect on wikis beside enwiki, we need to know what languages are most often used there (in poorly-performing queries), and limit language detection to "valuable" languages for a given wiki. E.g., on enwiki, there aren't that many French queries, and many more queries are incorrectly identified as French than correctly identified, making it a net loss. Obviously, we'd need French on frwiki. We can generally work this out to within a few percent with a sample of 500-1,000.

Work on top N languages and determine the best mix of languages to use for each of them. Each evaluation set would be a set of 500+ poorly-performing queries from the given wiki, manually tagged by language. It takes half a day to a day to do if you are familiar with the main language of the wiki, up to 2 days if not, and evaluation on a given set of language models takes a couple of hours at most. (depends on T121539 to make sure we aren't wasting time on a main language that does not perform well)

Based on the search metrics dashboard the top 12 languages by volume* are English, German, Spanish, Portuguese, Russian, French, Italian, Japanese, Polish, Arabic, Chinese, and Dutch—so I'm re-aligning the remaining work to match this list.

[* For now, N = 12 and that accounts for just over 90% of search volume.]

Estimate is roughly two days per wiki to generate an evaluation set and evaluate it against our current best language identification tools and select the right mix of languages for that tool set.

Done:

Italian, German, Spanish, and French (T132466)
English (the older enwiki corpus we've been using is very different and should be re-done so it is more comparable) (T138315)
Russian, Japanese, Portuguese (also T138315)
Dutch (T142140)

To Do (mostly in sets of 4, which works out to about 2 weeks in calendar time):

Polish, Arabic, Chinese

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)

Event Timeline

TJones created this task.Dec 15 2015, 5:47 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:47 PM

Restricted Application added subscribers: StudiesWorld, revi, Josve05a, Aklapper. · View Herald Transcript

TJones added a subtask: T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume.Dec 15 2015, 5:47 PM

TJones mentioned this in T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification.Dec 15 2015, 5:50 PM

TJones added a parent task: T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification.

TJones mentioned this in T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.Dec 15 2015, 5:53 PM

TJones added a parent task: T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification.

TJones mentioned this in T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones added a parent task: T121545: Wikipedia-Text–Based Language Models for Language Identification.

TJones mentioned this in T121546: Experiment with Equalizing Training Set Sizes for Language Identification.

TJones added a parent task: T121546: Experiment with Equalizing Training Set Sizes for Language Identification.

TJones added a parent task: T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data.Dec 15 2015, 5:55 PM

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:23 PM

• Deskana set Security to None.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

TJones mentioned this in T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Jan 6 2016, 11:37 PM

• Deskana triaged this task as Medium priority.Jan 14 2016, 5:43 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

EBernhardson removed a project: Discovery-Search (Current work).Feb 16 2016, 11:12 PM

• ksmith moved this task from On Sprint Board to Search on the Discovery-ARCHIVED board.Feb 16 2016, 11:24 PM

TJones mentioned this in T132466: Lang ID Eval Sets for Italian, German, Spanish, and French.Apr 12 2016, 3:49 PM

TJones created subtask T132466: Lang ID Eval Sets for Italian, German, Spanish, and French.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 12 2016, 3:49 PM

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.Apr 12 2016, 10:18 PM

TJones updated the task description. (Show Details)Apr 13 2016, 6:56 PM

Restricted Application added a subscriber: Base. · View Herald TranscriptApr 13 2016, 6:56 PM

TJones updated the task description. (Show Details)Apr 26 2016, 6:40 PM

• Deskana closed subtask T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume as Resolved.May 11 2016, 10:40 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.May 31 2016, 10:10 PM

debt closed subtask T132466: Lang ID Eval Sets for Italian, German, Spanish, and French as Resolved.Jun 8 2016, 12:38 AM

TJones updated the task description. (Show Details)Jun 21 2016, 3:25 PM

TJones updated the task description. (Show Details)Jun 21 2016, 3:33 PM

TJones mentioned this in T138315: Lang ID Eval Sets for English, Russian, Japanese, Portuguese.Jun 21 2016, 3:38 PM

TJones created subtask T138315: Lang ID Eval Sets for English, Russian, Japanese, Portuguese.

TJones updated the task description. (Show Details)

Doing all of these at once is a huge task, so I've peeled off another set of 4 + redoing English with the new process. (T138315)

The original English corpus was not limited to fulltext queries (and other less-bot-like sources), included lotsa bots, including DOI queries, and it had all the mobile app UNIX timestamp queries; it was limited to zero-results queries (rather than the new standard of < 3 results).

TJones mentioned this in T140300: Provide language identification to the long-tail of wikis.Jul 13 2016, 8:26 PM

Let's chat about this ticket later on this week or next after T138315 is done.

TJones mentioned this in T26767: Multilingual search on project portals (e.g. www.wikipedia.org).Aug 2 2016, 8:04 PM

debt added a project: Epic.Aug 4 2016, 6:23 PM

From a conversation with @TJones:

We’ve been doing this in chunks (usually 4 at a time). We need to decide the high-effort cut-off (i.e, determine N) and do the rest of those (and we should probably re-ordering the top ones based on the dashboard).

We will pursue this one after determining the “high-effort” value of N (a higher priority)

TJones updated the task description. (Show Details)Aug 4 2016, 8:52 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Aug 4 2016, 8:54 PM

TJones moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.

TJones mentioned this in T142140: Lang ID Eval Set for Dutch.Aug 4 2016, 8:57 PM

TJones created subtask T142140: Lang ID Eval Set for Dutch.

TJones updated the task description. (Show Details)

I've updated the description based on our evaluation of what N should be. The top 12 cover 90% of query volume. We'll work on the top-25 (>96%), but not with the full lang eval set here.

Liuxinyu970226 subscribed.Aug 19 2016, 12:43 AM

debt closed subtask T138315: Lang ID Eval Sets for English, Russian, Japanese, Portuguese as Resolved.Aug 22 2016, 9:43 PM

If such deployments are just require translation, shouldn't we also enable on Asturian, Galician, Hebrew, Macedonian, Ukranian, and Vietnamese, as their ext-wikimediainterwikisearchresults translations are 100% done?

In T121541#2589176, @Liuxinyu970226 wrote:

If such deployments are just require translation, shouldn't we also enable on Asturian, Galician, Hebrew, Macedonian, Ukranian, and Vietnamese, as their ext-wikimediainterwikisearchresults translations are 100% done?

Unfortunately, it's not that easy! There's a labor-intensive manual process behind selecting the languages to be enabled on each wiki, which includes extracting data, manually tagging it by language (that's the hard part, and what this ticket covers) and optimizing the set of languages to be considered on a given wiki based on that data and the performance of the language detection. My write ups for the last batch (T138315) are here.

The language identification is far from perfect—and it's made more difficult by the fact that many queries are just a couple of words—so we can't just turn on everything and hope for the best. So, we ignore languages that don't show up in a sample of 1000+ poorly-performing queries (or for easier-to-detect languages with a distinctive script, in a sample of 10,000 queries).

We also disable detectors for languages that are present, but for which performance is poor, which is related to sensitivity, specificity, and the relative ratios of the languages that get confused. For example, on English Wikipedia, we get a very small number of queries in French. However, the French detector gets a lot of false positives on short English queries with words of French origin (e.g., English words ending in -able or -ible are often identified as French if there aren't other words in the query). I'm hoping to improve accuracy in T140289, and is I do, I'll be able to quickly re-evaluate the language sets that have been deployed and enable more languages—but still not all of them.

TJones claimed this task.Nov 15 2016, 6:24 PM

TJones moved this task from Current work to This Quarter on the Discovery-Search board.

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).

TJones moved this task to needs triage on the Discovery-Search board.Nov 15 2016, 6:31 PM

TJones moved this task from needs triage to This Quarter on the Discovery-Search board.

TJones moved this task from This Quarter to Up Next on the Discovery-Search board.

TJones changed the status of subtask T142140: Lang ID Eval Set for Dutch from Open to Stalled.Nov 15 2016, 6:34 PM

TJones updated the task description. (Show Details)Feb 8 2017, 4:17 PM

TJones updated the task description. (Show Details)

TJones removed TJones as the assignee of this task.Feb 8 2017, 4:27 PM

• Deskana closed subtask T142140: Lang ID Eval Set for Dutch as Resolved.Feb 10 2017, 5:22 PM

Closing this ticket out - we've done as much as we can for now. We'd ultimately like to do Polish, Arabic, Chinese, eventually.

debt moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Apr 5 2017, 2:51 PM

Liuxinyu970226 unsubscribed.Apr 7 2017, 5:31 AM

RandomDSdevel awarded a token.Jun 17 2017, 10:14 PM

Create Properly Weighted Language Identification Evaluation Sets for Top N Other WikisClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Closed, ResolvedPublic
Actions

Related Objects
Search...