Review Chinese Analyzers previously found and look for others. Then, we'll test the analyzers to see if they really are better.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages | |||
Open | None | T154511 [Tracking] Research, test, and deploy new language analyzers | |||
Resolved | TJones | T158202 [Research spike, 4 hours] Research Chinese language analyzers | |||
Resolved | TJones | T158203 Test and analyze new Chinese language analyzers | |||
Resolved | TJones | T163829 Enable BM25 for Chinese wikis | |||
Resolved | None | T163832 Reindex Chinese wikis | |||
Resolved | TJones | T166722 Disable SmartCN for zh-hans |
Event Timeline
Below are my notes on analyzers for Chinese.
My plan is to compare STConvert to ZhConversion.php for conversion of Traditional to Simplified characters, and then compare SmartCN, IK, and MMSEG, against each other after Simplified conversion using STConvert (as-is or updated based on ZhConversion.php comparison). All are available with ES5 compatible plugin wrappers. More details in T158203.
Chinese
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-smartcn.html (v6.0.0.a)
Elastic-supported SmartCN, Simplified Chinese only
https://github.com/medcl/elasticsearch-analysis-ik (~1 week)
https://code.google.com/archive/p/ik-analyzer/ (5 years?)
IK, linked by Elasticsearch, updated to ES 5.1.2
supports customized dictionary (i.e., specific problems can probably be fixed)
Confirmed with Elasticsearch plugin developer (Medcl) that this is simplified only
https://github.com/medcl/elasticsearch-analysis-mmseg (~1 week)
https://github.com/chenlb/mmseg4j-solr (2 months)
MMSEG, linked by Elasticsearch, updated to ES 5.1.2
Confirmed with Elasticsearch plugin developer (Medcl) that this is simplified only
https://www.sitepoint.com/efficient-chinese-search-elasticsearch/ (2014)
“Efficient Chinese Search with Elasticsearch”
mentions that CJK (bigrams) isn’t terrible (should we look into it again?)
suggests converting Traditional to Simplified for searching.
mentions paoding (below)
https://code.google.com/archive/p/paoding/ (7 years?)
https://github.com/damienalexandre/elasticsearch-analysis-paoding (3 years)
old, not maintained, but reputed to have very good dictionaries (mentioned in blog above)
simplified only
https://github.com/medcl/elasticsearch-analysis-stconvert (7 days)
convert traditional to simplified or vice versa
is an elasticsearch plugin, compatible with ES5
Data file (for comparison to ZhConversion.php):
https://github.com/medcl/elasticsearch-analysis-stconvert/blob/master/src/main/resources/t2s.properties
http://sighan.cs.uchicago.edu/bakeoff2005/ (12 years!)
"The complete training, testing, and gold-standard data sets, as well as the scoring script, are available"
Even though this is old, it seems like a good framework for evaluation. It also includes some scoring results (from 2005) as a baseline for performance (median looks to be 92%-95% recall and precisions)
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
ICU Anlaysis plugin, "including better analysis of Asian languages"
https://github.com/wikimedia/mediawiki/blob/master/languages/data/ZhConversion.php (16 days ago)
MediaWiki traditional/simplified conversion module
Traditional /Simplified Converters (useful links)
http://www.khngai.com/chinese/tools/convert.php
https://www.branah.com/traditional-to-simplified
https://www.branah.com/simplified-to-traditional