Investigate Updating Cybozu / ES Plugin for Language Identification
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Dec 15 2015, 5:46 PM

Description

Use the training data created in T118287 for training models for the ES Plugin / Cybozu. Perhaps its difficulties with queries are partly due to the use of general language models. This could also include looking at the internals and seeing if there is any benefit to changing the model size or other internal configuration, including optionally disabling "unhelpful" models (e.g., Romanian when working on enwiki).

Estimate is difficult to make, but it could be time-boxed to 1 or 2 weeks.

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T137158 Compile and then resolve issues with TextCat A/B test data
Declined	mpopov	T134320 Analyse results of TextCat A/B test
Resolved	EBernhardson	T130321 Disable Schema:Search, since it's outdated and redundant
Resolved	mpopov	T129564 Switch Desktop data collection for dashboards to use TestSearchSatisfaction2 instead of Search schema
Resolved	EBernhardson	T134319 Turn off TextCat A/B test on the English Wikipedia on or after May 23
Resolved	debt	T134318 Verify data pipeline for TextCat A/B test on English Wikipedia
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	dcausse	T121540 Investigate Updating Cybozu / ES Plugin for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	mpopov	T132706 Validate click events in TestSearchSatisfaction2
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification

Event Timeline

TJones created this task.Dec 15 2015, 5:46 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:46 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones mentioned this in T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.Dec 15 2015, 5:48 PM

TJones added a parent task: T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:22 PM

• Deskana set Security to None.

Summoning @dcausse — this was originally your idea. Do you have time/interest to work on it? Do you want to call dibs?

Sure,

I think we could do it in 2 pass :

Reducing the list of detected languages: it's very easy, I just have to package another profile with the reduced list of language. But reading your notes training data seems to be important.
Training cybozu with the same training set, where can I find the training set?

In the end I'd say that cybozu must outperform textcat by a large margin in order to remain a contender:
TextCat PHP implementation offers many advantages :

Deployed within mediawiki: no extra call to another service, we would have to plug cybozu as a suggest query that can be inlined with the main query
Cybozu is poorly integrated into elasticsearch: only one model at a time, need to do a rolling restart to update the model

I'll upload a new version of the plugin with a limited language list so you can give it a try.
As soon as you give me the training set I'll upload a new one.

dcausse claimed this task.Dec 23 2015, 12:40 PM

dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I realized that we can already limit the language list with a config param:
in /etc/elasticsearch/elasticsearch.yml :

profile: "/langdetect/short-text/"
languages: en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th

I'm not sure about zh-cn and zh-tw, I think that in cirrus we consider both to be Chinese.

@TJones: Do you have a running elastic instance where you could set these settings and run an evaluation?
I can setup the hypothesis testing cluster with it if it's easier.

@dcausse, when I was evaluating the plugin originally, I truncated the language code at the dash, so that zh-cn and zh-tw both count as zh. As I understand it, zhwiki handles both traditional and simplified Chinese.

Looking at my config, I should be able to test locally with a limited language set.

Trey ran 2 other tests:

profile	F0.5
short-text	55
short-text with limited language list	75
input.filtered with limited language list	81.8

While we've seen good improvements TextCat (with the same training data) is better with 83.1.
Building the profiles and running the test was relatively easy, so we can re-test it if needed.

Based on this result I'd suggest to invest more time on TextCat and better training data.

dcausse moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Dec 23 2015, 5:04 PM

More details:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation#ElasticSearch_Plugin.E2.80.94Limiting_Languages_.26_Retraining

Liuxinyu970226 subscribed.Dec 26 2015, 11:46 PM

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

• Deskana closed this task as Resolved.Dec 31 2015, 5:15 AM

• Deskana triaged this task as Medium priority.

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

Liuxinyu970226 unsubscribed.Dec 31 2015, 9:26 AM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Jan 28 2016, 6:09 PM

Investigate Updating Cybozu / ES Plugin for Language IdentificationClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate Updating Cybozu / ES Plugin for Language Identification
Closed, ResolvedPublic
Actions

Related Objects
Search...