[EPIC] Improve Language Identification for use in Cirrus Search
Open, LowPublic
Actions

Assigned To

None

Authored By

	• Deskana
	Nov 10 2015, 4:53 PM

Description

Includes things like running additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better. Testing other libraries (including retraining on query data or other data) in "the lab" or in production.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	TJones	T118287 Run test with different library for detection language through the relevance lab, to decide how promising it is to invest further
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	dcausse	T121540 Investigate Updating Cybozu / ES Plugin for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	mpopov	T132706 Validate click events in TestSearchSatisfaction2
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T127338 Specify which languages TextCat should use
Resolved	TJones	T134427 TextCat Demo Page in Labs
Declined	None	T134430 Automate Process for Selecting Languages for a Given Wiki for Use With TextCat
Resolved	debt	T136034 [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs
Declined	None	T136637 Language detection demo for potential UI elements
Declined	None	T136639 Develop UI mockups for cross-project searching using TextCat (language detection)
Resolved	TJones	T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection
Resolved	TJones	T149314 update Trey’s lang ID evaluation tools
Resolved	TJones	T149316 allow TextCat to use multiple language model directories
Resolved	TJones	T149318 Add support for limiting min input length for TextCat
Resolved	TJones	T149320 Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy
Resolved	TJones	T149321 Optimize TextCat maximum returned languages and results ratio
Resolved	TJones	T149322 Bucketing & Bonuses for TextCat
Resolved	TJones	T149323 Qualitative confidence score for TextCat
Resolved	TJones	T149324 TextCat Improvement Deployment
Resolved	TJones	T155672 Deploy 10K models for TextCat (PHP & Perl)
Resolved	TJones	T151230 Consider Additional Unknown n-gram Penalty
Resolved	TJones	T153105 Refactor TextCat for ambiguity detection and add additional params
Declined	None	T155670 Investigate Ratio of First to Second Result Scores as a Confidence Measure
Open	None	T140292 A/B Test TextCat settings on non-WP projects
Declined	None	T142584 Investigate creating mocks for UI component of this A/B test
Open	None	T140300 Provide language identification to the long-tail of wikis
Open	None	T138958 Detect "wrong keyboard" queries for Russian/American keyboards on EN/RU Wikipedias
Resolved	TJones	T213931 Update TextCat with wrong-keyboard models
Declined	TJones	T213935 Revert changes to TextCat that add dependency on autoload.php
Resolved	Smalyshev	T213936 Deploy new version of TextCat
Resolved	TJones	T216083 Update required version of TextCat in CirrusSearch
Open	None	T146702 Allow for individual wiki's to disable cross-wiki search results
Open	None	T155104 Detect "wrong keyboard" queries for Hebrew/American keyboards on EN/HE Wikipedias
Open	None	T219912 Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5
Open	None	T219911 Retrain Chinese query-based language ID models
Open	None	T219915 Enable more of the unambiguous/less ambiguous scripts for language identification

Event Timeline

• Deskana created this task.Nov 10 2015, 4:53 PM

• Deskana raised the priority of this task from to Medium.

• Deskana updated the task description. (Show Details)

• Deskana added projects: Discovery-ARCHIVED, Discovery-Search (Current work), Epic.

• Deskana subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2015, 4:53 PM

• Deskana moved this task from Needs triage to Product Epics on the Discovery-ARCHIVED board.Nov 10 2015, 4:53 PM

• ksmith removed a project: Discovery-Search (Current work).Nov 10 2015, 5:54 PM

• ksmith set Security to None.

I need an Epic for tasks spun off of T118287, and this one was really close, so I've made hijacked it and made it slightly more general.

• Deskana closed subtask T118287: Run test with different library for detection language through the relevance lab, to decide how promising it is to invest further as Resolved.Dec 31 2015, 5:15 AM

• Deskana closed subtask T121540: Investigate Updating Cybozu / ES Plugin for Language Identification as Resolved.

TextCat and Language Detection

Back before the holidays (12/23/2015), Stas and Trey had a conversation on IRC about TextCat and Lang ID. There was lots of good stuff in the conversation, so the main points are summarized here, to record for posterity, and to open them up to further conversation if anyone has any additional ideas.

For reference, the main Phab ticket for language ID stuff is T118278: EPIC: Improve Language Identification for use in Cirrus Search[1]

Building Language Models: It seems like we should try to create language models to cover at least the same set of languages as the original TextCat. The original models were in various encodings, but we’d create (and have created) models in Unicode. In general, we saw better performance doing language detection on queries using models built on queries.[2] If we want to support general language identification, we could also build models based on text from Wikipedia (which we need to do for some languages anyway because the query data is so poor).[3] It’s a relatively straightforward task, compared to getting sufficiently high quality query data.[4]

Using Language Models: We get the biggest improvement in language detection accuracy (~20% increase in F0.5) from restricting the list of candidate languages based on their individual performance and the distribution of languages we encounter in real life, rather than using all available languages.[2][7] We need our new TextCat to support the ability to specify which models to use.[5] It makes sense to create models based on both query data (if we have it) and general text (from Wikipedia) and make them available, probably through Stas’s PHP version of TextCat on GitHub.[6] Trey will also be putting the Perl version and language models up on GitHub after a bit more cleanup.

Choosing Language Models: In order to choose which models to use on a particular wiki, we need to sample queries and manually identify the languages represented, and then experimentally determine the best set of language models to use.[8] We will do this for the wikis with the highest query volume, and see how far down the list we have time to work on. For any wikis we don’t get to, we can try using a generic set of languages, or just not do language detection for now, or make general capabilities available as an opt-in feature—though we need to think more carefully about how to handle smaller wikis, especially after we have more experience using TextCat on larger wikis.

In addition to evaluation sets for particular wikis, we’re have a task[9] to create a “balanced” set of queries in known languages for top wikis (by query volume) for general evaluation of language models, which can help us determine a generic set of more-or-less reliable languages. (These are smaller sets that let us gauge general performance, but not enough for training language models.)

Updating Language Model Choices: Trey’s estimate/intuition (which could use some validation) is that the per-wiki language lists would need updating at most once a quarter, though it’s possible that with appropriate metrics we could determine that we needed to do an update by a sudden or sustained gradual decrease in performance. We may need to think this through a bit more carefully, since different update pattern imply different places/ways to store the list of relevant language models. Stas says that quarterly updates are close enough to static to put language lists into some file in the Cirrus source, pretty much like we do with indexing profiles, etc. Alternatively, if updates are more frequent and per-wiki, we could store the list of languages to use in mediawiki-config.

[1] T118278
[2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Best_Options
[3] T121545
[4] T121547, etc. See [1] for more.
[5] T121538
[6] https://github.com/smalyshev/textcat
[7] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation#ElasticSearch_Plugin.E2.80.94Limiting_Languages_.26_Retraining
[8] T121541
[9] T121539

Smalyshev closed subtask T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search as Resolved.Jan 28 2016, 6:03 PM

TJones created subtask T127338: Specify which languages TextCat should use.Feb 18 2016, 5:27 PM

TJones added a project: CirrusSearch.

TJones removed a project: CirrusSearch.Feb 18 2016, 5:30 PM

Smalyshev closed subtask T127338: Specify which languages TextCat should use as Resolved.Feb 25 2016, 9:36 PM

• Deskana changed the status of subtask T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification from Open to Stalled.Apr 12 2016, 10:16 PM

TJones created subtask T134427: TextCat Demo Page in Labs.May 4 2016, 7:38 PM

TJones created subtask T134430: Automate Process for Selecting Languages for a Given Wiki for Use With TextCat.May 4 2016, 8:04 PM

• Deskana closed subtask T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume as Resolved.May 11 2016, 10:40 PM

TJones created subtask T136034: [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs.May 23 2016, 7:37 PM

debt subscribed.May 26 2016, 2:45 PM

TJones mentioned this in T136637: Language detection demo for potential UI elements.May 31 2016, 7:35 PM

TJones created subtask T136637: Language detection demo for potential UI elements.

debt closed subtask T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification as Resolved.Jun 8 2016, 12:37 AM

TJones added a subtask: T140289: Investigate Improvements and Confidence Measures for TextCat Language Detection.Jul 13 2016, 7:36 PM

TJones added a subtask: T140292: A/B Test TextCat settings on non-WP projects.Jul 13 2016, 7:59 PM

TJones added a subtask: T140300: Provide language identification to the long-tail of wikis.Jul 13 2016, 8:27 PM

debt closed subtask T134430: Automate Process for Selecting Languages for a Given Wiki for Use With TextCat as Declined.Jul 18 2016, 10:26 PM

debt added a subtask: T139310: [UI design] Search Results page: re-design page display of search results across languages.Jul 18 2016, 11:55 PM

debt closed subtask T134427: TextCat Demo Page in Labs as Resolved.Jul 21 2016, 3:44 PM

debt closed subtask T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification as Resolved.Jul 21 2016, 4:02 PM

Liuxinyu970226 subscribed.Jul 22 2016, 12:42 AM

• ksmith mentioned this in T26767: Multilingual search on project portals (e.g. www.wikipedia.org).Jul 22 2016, 7:28 PM

Cpiral mentioned this in T125944: Allow to search pages in a specific language, e.g. without translations.Jul 26 2016, 9:34 PM

debt closed subtask T136034: [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs as Resolved.Jul 27 2016, 7:43 PM

TJones added a subtask: T138958: Detect "wrong keyboard" queries for Russian/American keyboards on EN/RU Wikipedias.Aug 3 2016, 8:23 PM

debt closed subtask T121547: Improve Language Identification Training Data via Application of Language Models to the Training Data as Declined.Aug 4 2016, 7:01 PM

debt closed subtask T121544: Create Manually "Curated" Training Sets for Top N Languages for Language Identification as Declined.Aug 4 2016, 7:06 PM

debt closed subtask T136637: Language detection demo for potential UI elements as Declined.Aug 5 2016, 7:30 PM

debt closed subtask T121545: Wikipedia-Text–Based Language Models for Language Identification as Resolved.Aug 8 2016, 4:40 PM

debt mentioned this in T129627: Automatically switch to user's query language if user types characters associated with only one language.Sep 23 2016, 6:34 PM

CKoerner_WMF created subtask T146702: Allow for individual wiki's to disable cross-wiki search results.Sep 26 2016, 9:05 PM

debt closed subtask T139310: [UI design] Search Results page: re-design page display of search results across languages as Resolved.Jan 9 2017, 6:17 PM

TJones created subtask T155104: Detect "wrong keyboard" queries for Hebrew/American keyboards on EN/HE Wikipedias.Jan 11 2017, 6:23 PM

• Deskana removed a subtask: T139310: [UI design] Search Results page: re-design page display of search results across languages.Feb 13 2017, 5:52 PM

Nemo_bis mentioned this in T3837: Add the ability to simultaneously search all languages on the on-wiki search page.Mar 13 2017, 1:51 PM

• MZMcBride subscribed.Mar 13 2017, 1:54 PM

debt closed subtask T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis as Resolved.Apr 5 2017, 2:51 PM

debt closed subtask T121546: Experiment with Equalizing Training Set Sizes for Language Identification as Declined.Apr 5 2017, 2:53 PM

TJones closed subtask T140289: Investigate Improvements and Confidence Measures for TextCat Language Detection as Resolved.Jan 30 2019, 10:50 PM

TJones added a subtask: T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5.Apr 2 2019, 6:11 PM

TJones added a subtask: T219911: Retrain Chinese query-based language ID models.

TJones renamed this task from EPIC: Improve Language Identification for use in Cirrus Search to [EPIC] Improve Language Identification for use in Cirrus Search.Aug 27 2020, 8:16 PM

TJones edited projects, added Discovery-Search; removed Discovery-ARCHIVED.

TJones moved this task from needs triage to [epic] on the Discovery-Search board.

MPhamWMF lowered the priority of this task from Medium to Low.Mar 9 2022, 8:58 PM

[EPIC] Improve Language Identification for use in Cirrus SearchOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

[EPIC] Improve Language Identification for use in Cirrus Search
Open, LowPublic
Actions

Related Objects
Search...