Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Closed, ResolvedPublic

Description

If we want to deploy language detection to maximum effect on wikis beside enwiki, we need to know what languages are most often used there (in poorly-performing queries), and limit language detection to "valuable" languages for a given wiki. E.g., on enwiki, there aren't that many French queries, and many more queries are incorrectly identified as French than correctly identified, making it a net loss. Obviously, we'd need French on frwiki. We can generally work this out to within a few percent with a sample of 500-1,000.

Work on top N languages and determine the best mix of languages to use for each of them. Each evaluation set would be a set of 500+ poorly-performing queries from the given wiki, manually tagged by language. It takes half a day to a day to do if you are familiar with the main language of the wiki, up to 2 days if not, and evaluation on a given set of language models takes a couple of hours at most. (depends on T121539 to make sure we aren't wasting time on a main language that does not perform well)

Based on the search metrics dashboard the top 12 languages by volume* are English, German, Spanish, Portuguese, Russian, French, Italian, Japanese, Polish, Arabic, Chinese, and Dutch—so I'm re-aligning the remaining work to match this list.

[* For now, N = 12 and that accounts for just over 90% of search volume.]

Estimate is roughly two days per wiki to generate an evaluation set and evaluate it against our current best language identification tools and select the right mix of languages for that tool set.

Done:

  • Italian, German, Spanish, and French (T132466)
  • English (the older enwiki corpus we've been using is very different and should be re-done so it is more comparable) (T138315)
  • Russian, Japanese, Portuguese (also T138315)
  • Dutch (T142140)

To Do (mostly in sets of 4, which works out to about 2 weeks in calendar time):

  • Polish, Arabic, Chinese

Related Objects

StatusAssignedTask
OpenNone
ResolvedEBernhardson
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolveddpatrick
ResolvedEBernhardson
DeclinedNone
DeclinedNone
ResolvedTJones
DeclinedNone
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
OpenAnikethfoss
ResolvedTJones
Resolveddebt
TJones created this task.Dec 15 2015, 5:47 PM
TJones updated the task description. (Show Details)
TJones raised the priority of this task from to Needs Triage.
TJones added a project: CirrusSearch.
TJones added a subscriber: TJones.
Restricted Application added a project: Discovery. · View Herald TranscriptDec 15 2015, 5:47 PM
Restricted Application added subscribers: StudiesWorld, revi, Josve05a, Aklapper. · View Herald Transcript
Deskana triaged this task as Normal priority.
Deskana added a subscriber: Deskana.
ksmith moved this task from On Sprint Board to Search on the Discovery board.Feb 16 2016, 11:24 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 12 2016, 3:49 PM
TJones updated the task description. (Show Details)Apr 13 2016, 6:56 PM
Restricted Application added a subscriber: Base. · View Herald TranscriptApr 13 2016, 6:56 PM
TJones updated the task description. (Show Details)Apr 26 2016, 6:40 PM
TJones updated the task description. (Show Details)Jun 21 2016, 3:25 PM
TJones updated the task description. (Show Details)Jun 21 2016, 3:33 PM

Doing all of these at once is a huge task, so I've peeled off another set of 4 + redoing English with the new process. (T138315)

The original English corpus was not limited to fulltext queries (and other less-bot-like sources), included lotsa bots, including DOI queries, and it had all the mobile app UNIX timestamp queries; it was limited to zero-results queries (rather than the new standard of < 3 results).

debt added a subscriber: debt.Jul 27 2016, 7:44 PM

Let's chat about this ticket later on this week or next after T138315 is done.

debt added a project: Epic.Aug 4 2016, 6:23 PM
debt added a comment.Aug 4 2016, 7:04 PM

From a conversation with @TJones:

We’ve been doing this in chunks (usually 4 at a time). We need to decide the high-effort cut-off (i.e, determine N) and do the rest of those (and we should probably re-ordering the top ones based on the dashboard).

We will pursue this one after determining the “high-effort” value of N (a higher priority)

TJones updated the task description. (Show Details)Aug 4 2016, 8:52 PM
TJones added a comment.Aug 4 2016, 9:04 PM

I've updated the description based on our evaluation of what N should be. The top 12 cover 90% of query volume. We'll work on the top-25 (>96%), but not with the full lang eval set here.

If such deployments are just require translation, shouldn't we also enable on Asturian, Galician, Hebrew, Macedonian, Ukranian, and Vietnamese, as their ext-wikimediainterwikisearchresults translations are 100% done?

If such deployments are just require translation, shouldn't we also enable on Asturian, Galician, Hebrew, Macedonian, Ukranian, and Vietnamese, as their ext-wikimediainterwikisearchresults translations are 100% done?

Unfortunately, it's not that easy! There's a labor-intensive manual process behind selecting the languages to be enabled on each wiki, which includes extracting data, manually tagging it by language (that's the hard part, and what this ticket covers) and optimizing the set of languages to be considered on a given wiki based on that data and the performance of the language detection. My write ups for the last batch (T138315) are here.

The language identification is far from perfect—and it's made more difficult by the fact that many queries are just a couple of words—so we can't just turn on everything and hope for the best. So, we ignore languages that don't show up in a sample of 1000+ poorly-performing queries (or for easier-to-detect languages with a distinctive script, in a sample of 10,000 queries).

We also disable detectors for languages that are present, but for which performance is poor, which is related to sensitivity, specificity, and the relative ratios of the languages that get confused. For example, on English Wikipedia, we get a very small number of queries in French. However, the French detector gets a lot of false positives on short English queries with words of French origin (e.g., English words ending in -able or -ible are often identified as French if there aren't other words in the query). I'm hoping to improve accuracy in T140289, and is I do, I'll be able to quickly re-evaluate the language sets that have been deployed and enable more languages—but still not all of them.

TJones claimed this task.
TJones moved this task from Needs triage to This Quarter on the Discovery-Search board.
TJones moved this task from This Quarter to Up Next on the Discovery-Search board.
TJones changed the status of subtask T142140: Lang ID Eval Set for Dutch from Open to Stalled.Nov 15 2016, 6:34 PM
TJones updated the task description. (Show Details)Feb 8 2017, 4:17 PM
TJones updated the task description. (Show Details)
TJones removed TJones as the assignee of this task.Feb 8 2017, 4:27 PM
debt closed this task as Resolved.Apr 5 2017, 2:51 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
debt claimed this task.

Closing this ticket out - we've done as much as we can for now. We'd ultimately like to do Polish, Arabic, Chinese, eventually.