Page MenuHomePhabricator

Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Closed, DeclinedPublic

Description

At least some of the training sets for languages used with TextCat in T118287 are pretty crappy because they have lots of other languages in them. Create larger manually "curated" training sets (~20K entries) for languages with really crappy training data (e.g., Igbo) that's contaminated with English and other junk. (could depend on and be gated by the results of T121545, T121546, and T121547; could be tested via re-test of data in T121539 or T121541 with current best language identification module)

T121545, T121546, and T121547 are potentially less expensive (though less exhaustive and less accurate) methods, and probably should be tried first.

Note that who can review this data is limited because it potentially contains PII. (Unfortunately!!)

From T118287, the next 20 languages by volume after English are Italian (though known to have many duplicates due to cross-wiki searches), German, Spanish, French, Russian, Japanese, Portuguese, Indonesian, Arabic, Chinese, Dutch, Polish, Czech, Turkish, Farsii, Korean, Swedish, Vietnamese, Ukranian, and Hebrew. (Sorting by "filtered queries" from T118287 drops Hebrew for Finnish and gives a slightly different order—except for Italian, which drops to 9th.)

Estimate: 2-4 days per language, for someone familiar with the language.

Related Objects

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
TJones updated the task description. (Show Details)
TJones set Security to None.
Deskana raised the priority of this task from Low to Medium.Jan 27 2016, 11:09 PM
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.
Deskana subscribed.

I'm not sure whether we should pursue this task right now. When I created this big maze of related language ID tasks, I was brainstorming about what was possible and how things are related. Not everything linked to anything else is necessary, but there may be a preferred order of operations for various tasks.

This is a lot of work per language—half a week to a week if you are familiar with the language, maybe double if not, and I was only guessing that 20K examples would be enough to create a decent model.

Now having more experience with different groups of languages searched in on different wikis, I think maybe this isn't worth doing for the really ugly query corpora (e.g., those where almost half the queries are in English but the wiki isn't in English).

Query-based models do get a few more percentage points of accuracy than wiki-text based models (e.g., no one will type pq for porque in an article on eswiki or ptwiki, but users do it all the time in queries, so models accounting for that will perform better). The biggest improvement, though, is from restricting the language set applied to a corpus (i.e., the language evaluations I'm working on for the bigger wikis).

Maybe we should review all these tasks again and think about what's worth doing and what's not, what's the biggest bang for the buck, and clarify our priorities on where to focus effort (bigger wikis with more users, or smaller wikis on the long tail), and whether to explore other less effortful or more automated options.

debt lowered the priority of this task from Medium to Low.Aug 2 2016, 6:42 PM
debt edited projects, added Discovery-Search; removed Discovery-Search (Current work).
debt subscribed.

moving this to the backlog board - we're doing a lot of this work in the individual tickets that are listed in the description.

From a conversation with @TJones:

Query data makes better models for identifying queries than “normal” text data. Sometimes it’s easy to gather, other times not so much (usually because it’s hard to strip out most of the wrong-language queries). The idea here was to put more work into creating a query-based corpus. I don’t think it’s worth it for Wikipedias down the long tail. They don’t have the same level of usage, and it’s more work than I originally thought.