Page MenuHomePhabricator

Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Closed, ResolvedPublic

Description

We can do a better more balanced assessment of the new language models (T118287) to decide which ones are really not good (e.g., probably Igbo) and which ones are just not appropriate for enwiki (e.g., hopefully French and German).

The obvious approach is to create a "fair" evaluation test set with equal numbers of examples for each language (say, 100 random queries for each language, manually reviewed to make sure they are in the proper language), and evaluating performance on that set.

Randomly sample ~1000 queries from a given wiki, randomize their order, and delete non-target language queries (junk, names, DOI, obvious bots, other languages) until there are 100 good queries. Repeat for N languages, sorted by query volume.

From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Farsii, Korean, Swedish, Vietnamese, Ukranian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.)

Estimate is 1 hour for fresh data extraction and setup, plus 1-2 hours per language for most languages.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedEBernhardson
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson
DeclinedNone
DeclinedNone
ResolvedTJones
DeclinedNone
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt

Event Timeline

TJones raised the priority of this task from to High.
TJones updated the task description. (Show Details)
TJones subscribed.

I've completed the creation of a 21-language balanced (i.e., 200 each) corpus of relatively clean queries for use in evaluating language identification model testing. The 21 languages were chosen based on query volume across wikis in those languages. I've also evaluated our current version of TextCat against this corpus, using the known 21 languages, and all 59 languages I have models for.

The 21 languages have pretty good models, because they had lots of query volume to be built on. The full set of 59 is a bit more dodgy, esp. Igbo, which is known to have a lot of English in the training data.

Indonesian is the most unexpectedly poor performing of the bunch (most other poor performance is across language or script families and so is expected).

The best model size among those test (500 to 10K), was the full 10,000! However performance at the 3,000 ngram model size (what we've been using for A/B tests) was only a few percentage points worse.

Full write up with lots more details here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries

I'll commit models for the rest of these 21 languages after verifying that they won't mess up our A/B tests.

Change 275051 had a related patch set uploaded (by Tjones):
Add newly validated query-based language models to TextCat

https://gerrit.wikimedia.org/r/275051

Change 275051 merged by jenkins-bot:
Add newly validated query-based language models to TextCat

https://gerrit.wikimedia.org/r/275051

Deskana lowered the priority of this task from High to Medium.