Page MenuHomePhabricator

Create an ordered list of languages we want to find new analysers for
Closed, ResolvedPublic

Description

This quarter we're researching new language analysers (see relevant mailing list post). We have some ideas for what languages need new analysers (e.g. Polish, Chinese, Hebrew, etc.) but we'd like to take a more structured look at the problem to see where we can best focus our efforts. For the sake of expediency, we'll still be starting with Polish (see T154516).

Sadly, this is something that's hard to involve volunteers in; native speakers of languages are not exceptionally hard for us to find, but most users have no idea about things like tokenisation, n-grams, and such, so explaining to them exactly what we want is fairly hard.

Event Timeline

Deskana moved this task from needs triage to Current work on the Discovery-Search board.

It seems like potential easy wins are the Elastic Core Plugins, which include Polish (Stempel) and Ukrainian (Morfologik), languages which have been mentioned before as needing improvement. It also includes Japanese (Kuromoji) and Chinese (SmartCN)—though I have a vague recollection that those may not perform very well.

That page also lists a Hebrew Analysis plugin, and we have the other one, mentioned in the parent Epic (T154511).

So my first draft of a list would be:

  • Polish—Elastic says it "provides high quality stemming for Polish", and it's probably easy.
  • Chinese—we really need this, and we know of SmartCN and others to consider.
  • Ukrainian—Elastic has one, though it only "provides stemming for Ukrainian" (no "high quality claim"); we're currently using Russian, which is better than nothing, but not at all great.
  • Hebrew—Recently requested / suggested, and Elastic suggests HebMorph as well.
  • Japanese—We're using CJK analysis in production, which is just bigrams. Maybe Kuromoji is better?

None of these are too far off the beaten path since Elastic recommends all of them; we'll gain some expertise and learn how to do this process better, especially working with the community for review and evaluation, while hopefully not having too many technical hurdles to deal with.

Also, I don't think we'll finish all 5 by the end of the quarter, but let's see how Polish goes.

debt subscribed.

In sprint planning, we discussed the list and it looks good - @TJones will start with Polish.

@dcausse noted that SmartCN couldn't handle both traditional and simplified Chinese (so we'll have to see if that's still the case), and that there is another Polish analyzer using Morfologik (same framework used for Elastic's suggested Ukrainian) but it isn't mature enough to use.

I think this task is done, since it's just to establish the list, right?

Thanks for the list.
I consider this task done, more precise questions will have to be answered in research task like T154516.