Page MenuHomePhabricator

Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Closed, ResolvedPublic

Description

Can't work on all of them at once, so continue down the list. See parent task T121541.

Dropping Indonesian because we're working from a new volume-based list from the search metrics dashboard.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedEBernhardson
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson
DeclinedNone
DeclinedNone
ResolvedTJones
DeclinedNone
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss

Event Timeline

Restricted Application added a subscriber: Zppix. · View Herald Transcript

English is done, and it came out similar to the previous ZRR-based corpus (which also included API calls and no anti-bot precautions).

Portuguese is done. Portuguese typos often look a lot like Spanish typos! Nonetheless, ptwiki's low-performing queries are mostly in Portuguese (>90%), so accuracy is very high (> 95%).

Russian is done. About 77% of poor-performing ruwiki queries are in Russian, with a sizable amount in English (>10%) and Ukrainian (<5%), and a moderately long tail of other languages. Overall accuracy is good (>90%), despite not having models for a fair number of languages in the long tail.

Japanese is done. It's mostly Japanese (big surprise!), with a dollop of English, and a bit of Chinese. Unfortunately, the Chinese gets too many false positives on Japanese queries, so we have to disable it. (Maybe that TextCat Confidence thing would help.)

TJones renamed this task from Lang ID Eval Sets for English, Russian, Japanese, Portuguese, Indonesian to Lang ID Eval Sets for English, Russian, Japanese, Portuguese.Aug 4 2016, 8:53 PM
TJones updated the task description. (Show Details)