Two ways to start:
- Languages that we really want to make big improvements on because we don't support them well (e.g. spaceless languages)
- Test analysers that we know to be very mature (e.g. there's a Polish analyser that @dcausse knows about and likes)
Things to consider:
- How much better the analyser is than what we've got
- Maintainability of the code of the analyser
- [add more!]
Languages/analyzers to consider (from T155549):
- Polish—[[ https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis.html | Elastic says ]] theirs "provides high quality stemming for Polish", and it's probably easy. (T154516 / T154517)
- Chinese—we really need this, and we know of [[ https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis.html | SmartCN ]] and others to consider. (T158202 / T158203 )
- Ukrainian—[[ https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis.html | Elastic has one ]], though it only "provides stemming for Ukrainian" (no "high quality claim"); we're currently using Russian, which is better than nothing, but not at all great. (T160105 / T160106)
- Hebrew—Recently requested / suggested, and Elastic suggests [[ https://github.com/synhershko/HebMorph/wiki | HebMorph ]] as well.
- Japanese—We're using CJK analysis in production, which is just bigrams. Maybe [[ https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis.html | Elastic's Kuromoji ]] is better?