Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Jun 21 2016, 3:38 PM

Description

Can't work on all of them at once, so continue down the list. See parent task T121541.

Dropping Indonesian because we're working from a new volume-based list from the search metrics dashboard.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"

Event Timeline

TJones created this task.Jun 21 2016, 3:38 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 21 2016, 3:38 PM

Restricted Application added a subscriber: Zppix. · View Herald Transcript

TJones mentioned this in T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.Jun 21 2016, 3:38 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jun 21 2016, 3:42 PM

EBernhardson removed a project: Discovery-Search.Jun 21 2016, 10:11 PM

English is done, and it came out similar to the previous ZRR-based corpus (which also included API calls and no anti-bot precautions).

Portuguese is done. Portuguese typos often look a lot like Spanish typos! Nonetheless, ptwiki's low-performing queries are mostly in Portuguese (>90%), so accuracy is very high (> 95%).

Russian is done. About 77% of poor-performing ruwiki queries are in Russian, with a sizable amount in English (>10%) and Ukrainian (<5%), and a moderately long tail of other languages. Overall accuracy is good (>90%), despite not having models for a fair number of languages in the long tail.

Japanese is done. It's mostly Japanese (big surprise!), with a dollop of English, and a bit of Chinese. Unfortunately, the Chinese gets too many false positives on Japanese queries, so we have to disable it. (Maybe that TextCat Confidence thing would help.)

TJones renamed this task from Lang ID Eval Sets for English, Russian, Japanese, Portuguese, Indonesian to Lang ID Eval Sets for English, Russian, Japanese, Portuguese.Aug 4 2016, 8:53 PM

TJones updated the task description. (Show Details)

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.

TJones created subtask T142413: Deploy recommended languages for Russian, Japanese, and Portuguese.Aug 8 2016, 5:38 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Aug 18 2016, 9:19 PM

Liuxinyu970226 subscribed.Aug 19 2016, 12:19 AM

debt closed this task as Resolved.Aug 22 2016, 9:43 PM

debt closed subtask T142413: Deploy recommended languages for Russian, Japanese, and Portuguese as Resolved.

Liuxinyu970226 unsubscribed.Aug 22 2016, 10:44 PM