Page MenuHomePhabricator

Wikipedia-Text–Based Language Models for Language Identification
Closed, ResolvedPublic

Description

Though T118287 shows that query-based text models work better than wiki-text–based language models, getting good query data can be hard. So, we should see if Wikipedia-based language models for languages with crappy training data do better. (could obviate the need for T121544 or T121546 in some cases; could be tested via re-test of data in T121539 or T121541 with current best language identification module)

Need list of target languages to work on (based on T121539).

Estimate: 2-4 hours per language to create corpuse, do minimal cleanup, and build & test model(s).

Related Objects

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones added a subscriber: TJones.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Deskana triaged this task as Medium priority.Jan 27 2016, 11:09 PM
Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.
Deskana added a subscriber: Deskana.

I've generated these models based on lightly cleaned up wiki-text and submitted them for review.

The models haven't been vetted, but T121539 isn't going to cover all 70 of these languages anyway.

From a conversation with @TJones:

This is a good place to spend time. Wikitext models are generally easy to create by pulling articles from the relevant Wikipedia for training data. The bang-for-your-buck is probably high enough to do this for all of the medium-effort section of the long-tail (many are already done)

As we are focusing on the top 25 languages by volume from the dashboard, we have either query-based models or wiki-text–based models for all 25. (Only Finnish, Hungarian, Norwegian, and Tagalog don't have query-based models.)

BTW, the top 25 are: English, German, Spanish, Portuguese, Russian, French, Italian, Japanese, Polish, Arabic, Chinese, Dutch, Turkish, Swedish, Persian, Czech, Vietnamese, Indonesian, Korean, Finnish, Hebrew, Tagalog, Thai, Norwegian, and Hungarian.)

If we need to create new wiki-text–based models while working on the current batch of language wikipedias—say, because there's a new language that's used a lot on one of these wikis—I can spin up a new subtask to build that wiki-text model.

So, I say mark this one done.

Moving to done column on sprint board, based on @TJones's comment above!