Page MenuHomePhabricator

[Research spike, 4 hours] Research Chinese language analyzers
Closed, ResolvedPublic

Description

Review Chinese Analyzers previously found and look for others. Then, we'll test the analyzers to see if they really are better.

Event Timeline

Below are my notes on analyzers for Chinese.

My plan is to compare STConvert to ZhConversion.php for conversion of Traditional to Simplified characters, and then compare SmartCN, IK, and MMSEG, against each other after Simplified conversion using STConvert (as-is or updated based on ZhConversion.php comparison). All are available with ES5 compatible plugin wrappers. More details in T158203.

Chinese
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-smartcn.html (v6.0.0.a)
Elastic-supported SmartCN, Simplified Chinese only

https://github.com/medcl/elasticsearch-analysis-ik (~1 week)
https://code.google.com/archive/p/ik-analyzer/ (5 years?)
IK, linked by Elasticsearch, updated to ES 5.1.2
supports customized dictionary (i.e., specific problems can probably be fixed)
Confirmed with Elasticsearch plugin developer (Medcl) that this is simplified only

https://github.com/medcl/elasticsearch-analysis-mmseg (~1 week)
https://github.com/chenlb/mmseg4j-solr (2 months)
MMSEG, linked by Elasticsearch, updated to ES 5.1.2
Confirmed with Elasticsearch plugin developer (Medcl) that this is simplified only

https://www.sitepoint.com/efficient-chinese-search-elasticsearch/ (2014)
“Efficient Chinese Search with Elasticsearch”
mentions that CJK (bigrams) isn’t terrible (should we look into it again?)
suggests converting Traditional to Simplified for searching.
mentions paoding (below)

https://code.google.com/archive/p/paoding/ (7 years?)
https://github.com/damienalexandre/elasticsearch-analysis-paoding (3 years)
old, not maintained, but reputed to have very good dictionaries (mentioned in blog above)
simplified only

https://github.com/medcl/elasticsearch-analysis-stconvert (7 days)
convert traditional to simplified or vice versa
is an elasticsearch plugin, compatible with ES5
Data file (for comparison to ZhConversion.php):
https://github.com/medcl/elasticsearch-analysis-stconvert/blob/master/src/main/resources/t2s.properties

http://sighan.cs.uchicago.edu/bakeoff2005/ (12 years!)
"The complete training, testing, and gold-standard data sets, as well as the scoring script, are available"
Even though this is old, it seems like a good framework for evaluation. It also includes some scoring results (from 2005) as a baseline for performance (median looks to be 92%-95% recall and precisions)

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
ICU Anlaysis plugin, "including better analysis of Asian languages"

https://github.com/wikimedia/mediawiki/blob/master/languages/data/ZhConversion.php (16 days ago)
MediaWiki traditional/simplified conversion module

Traditional /Simplified Converters (useful links)
http://www.khngai.com/chinese/tools/convert.php

https://www.branah.com/traditional-to-simplified
https://www.branah.com/simplified-to-traditional

http://www.chinese-tools.com/tools/converter-tradsimp.html