Page MenuHomePhabricator

Investigate Updating Cybozu / ES Plugin for Language Identification
Closed, ResolvedPublic

Description

Use the training data created in T118287 for training models for the ES Plugin / Cybozu. Perhaps its difficulties with queries are partly due to the use of general language models. This could also include looking at the internals and seeing if there is any benefit to changing the model size or other internal configuration, including optionally disabling "unhelpful" models (e.g., Romanian when working on enwiki).

Estimate is difficult to make, but it could be time-boxed to 1 or 2 weeks.

Related Objects

StatusSubtypeAssignedTask
ResolvedEBernhardson
Declinedmpopov
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
OpenNone
ResolvedEBernhardson
Resolveddcausse
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Summoning @dcausse — this was originally your idea. Do you have time/interest to work on it? Do you want to call dibs?

Sure,

I think we could do it in 2 pass :

  1. Reducing the list of detected languages: it's very easy, I just have to package another profile with the reduced list of language. But reading your notes training data seems to be important.
  2. Training cybozu with the same training set, where can I find the training set?

In the end I'd say that cybozu must outperform textcat by a large margin in order to remain a contender:
TextCat PHP implementation offers many advantages :

  • Deployed within mediawiki: no extra call to another service, we would have to plug cybozu as a suggest query that can be inlined with the main query
  • Cybozu is poorly integrated into elasticsearch: only one model at a time, need to do a rolling restart to update the model

I'll upload a new version of the plugin with a limited language list so you can give it a try.
As soon as you give me the training set I'll upload a new one.

I realized that we can already limit the language list with a config param:
in /etc/elasticsearch/elasticsearch.yml :

profile: "/langdetect/short-text/"
languages: en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th

I'm not sure about zh-cn and zh-tw, I think that in cirrus we consider both to be Chinese.

@TJones: Do you have a running elastic instance where you could set these settings and run an evaluation?
I can setup the hypothesis testing cluster with it if it's easier.

@dcausse, when I was evaluating the plugin originally, I truncated the language code at the dash, so that zh-cn and zh-tw both count as zh. As I understand it, zhwiki handles both traditional and simplified Chinese.

Looking at my config, I should be able to test locally with a limited language set.

Trey ran 2 other tests:

profileF0.5
short-text55
short-text with limited language list75
input.filtered with limited language list81.8

While we've seen good improvements TextCat (with the same training data) is better with 83.1.
Building the profiles and running the test was relatively easy, so we can re-test it if needed.

Based on this result I'd suggest to invest more time on TextCat and better training data.

Deskana triaged this task as Medium priority.
Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.
Deskana subscribed.