Page MenuHomePhabricator

[EPIC] Improve Language Identification for use in Cirrus Search
Open, LowPublic

Description

Includes things like running additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better. Testing other libraries (including retraining on query data or other data) in "the lab" or in production.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedTJones
ResolvedEBernhardson
Resolveddcausse
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson
DeclinedNone
DeclinedNone
ResolvedTJones
DeclinedNone
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt
ResolvedSmalyshev
ResolvedTJones
DeclinedNone
Resolveddebt
DeclinedNone
DeclinedNone
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
DeclinedNone
OpenNone
DeclinedNone
OpenNone
OpenNone
ResolvedTJones
DeclinedTJones
ResolvedSmalyshev
ResolvedTJones
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone

Event Timeline

Deskana raised the priority of this task from to Medium.
Deskana updated the task description. (Show Details)
Deskana subscribed.
TJones renamed this task from EPIC: Run additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better to EPIC: Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:34 PM
TJones updated the task description. (Show Details)
TJones subscribed.

I need an Epic for tasks spun off of T118287, and this one was really close, so I've made hijacked it and made it slightly more general.

TextCat and Language Detection

Back before the holidays (12/23/2015), Stas and Trey had a conversation on IRC about TextCat and Lang ID. There was lots of good stuff in the conversation, so the main points are summarized here, to record for posterity, and to open them up to further conversation if anyone has any additional ideas.

For reference, the main Phab ticket for language ID stuff is T118278: EPIC: Improve Language Identification for use in Cirrus Search[1]

Building Language Models: It seems like we should try to create language models to cover at least the same set of languages as the original TextCat. The original models were in various encodings, but we’d create (and have created) models in Unicode. In general, we saw better performance doing language detection on queries using models built on queries.[2] If we want to support general language identification, we could also build models based on text from Wikipedia (which we need to do for some languages anyway because the query data is so poor).[3] It’s a relatively straightforward task, compared to getting sufficiently high quality query data.[4]

Using Language Models: We get the biggest improvement in language detection accuracy (~20% increase in F0.5) from restricting the list of candidate languages based on their individual performance and the distribution of languages we encounter in real life, rather than using all available languages.[2][7] We need our new TextCat to support the ability to specify which models to use.[5] It makes sense to create models based on both query data (if we have it) and general text (from Wikipedia) and make them available, probably through Stas’s PHP version of TextCat on GitHub.[6] Trey will also be putting the Perl version and language models up on GitHub after a bit more cleanup.

Choosing Language Models: In order to choose which models to use on a particular wiki, we need to sample queries and manually identify the languages represented, and then experimentally determine the best set of language models to use.[8] We will do this for the wikis with the highest query volume, and see how far down the list we have time to work on. For any wikis we don’t get to, we can try using a generic set of languages, or just not do language detection for now, or make general capabilities available as an opt-in feature—though we need to think more carefully about how to handle smaller wikis, especially after we have more experience using TextCat on larger wikis.

In addition to evaluation sets for particular wikis, we’re have a task[9] to create a “balanced” set of queries in known languages for top wikis (by query volume) for general evaluation of language models, which can help us determine a generic set of more-or-less reliable languages. (These are smaller sets that let us gauge general performance, but not enough for training language models.)

Updating Language Model Choices: Trey’s estimate/intuition (which could use some validation) is that the per-wiki language lists would need updating at most once a quarter, though it’s possible that with appropriate metrics we could determine that we needed to do an update by a sudden or sustained gradual decrease in performance. We may need to think this through a bit more carefully, since different update pattern imply different places/ways to store the list of relevant language models. Stas says that quarterly updates are close enough to static to put language lists into some file in the Cirrus source, pretty much like we do with indexing profiles, etc. Alternatively, if updates are more frequent and per-wiki, we could store the list of languages to use in mediawiki-config.

[1] T118278
[2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Best_Options
[3] T121545
[4] T121547, etc. See [1] for more.
[5] T121538
[6] https://github.com/smalyshev/textcat
[7] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation#ElasticSearch_Plugin.E2.80.94Limiting_Languages_.26_Retraining
[8] T121541
[9] T121539

TJones renamed this task from EPIC: Improve Language Identification for use in Cirrus Search to [EPIC] Improve Language Identification for use in Cirrus Search.Aug 27 2020, 8:16 PM
TJones edited projects, added Discovery-Search; removed Discovery-ARCHIVED.
TJones moved this task from needs triage to [epic] on the Discovery-Search board.
MPhamWMF lowered the priority of this task from Medium to Low.Mar 9 2022, 8:58 PM