Page MenuHomePhabricator

Test and analyze new Chinese language analyzers
Closed, ResolvedPublic

Description

After the research in T158202 has found some analyzers for Chinese that are potentially better, we will test them, and analyze to see if they are better or not. If they are, we will file a task to deploy one of them.

Related tasks:

  • deploy the plugins (T160948)
  • enable BM25 (which now works better with the new analysis in place—T163829)
  • reindex (T163832)

Event Timeline

TJones renamed this task from Test and analyze new Polish language analyzers to Test and analyze new Chinese language analyzers.Feb 15 2017, 3:53 PM
TJones created this task.

Based on the research from T158202, my analysis & test plan is as follows:

All of the Chinese segmentation candidates I've found to date expect simplified Chinese characters as input. Chinese Wikipedia supports both traditional and simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts traditional to simplified or vice versa. I suggest trying to set up an analysis chain using that, and segment and index everything as simplified.

There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers.

So, my evaluation plan is to first compare the output of STConvert with MediaWiki ZhConversion.php to make sure they do not have wildly differing results, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable.

I'll try to set up the SigHan analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.

I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being much more easily patched might be more valuable than 0.02% better accuracy, for example.

We'll also test the highlighting for cross-character type (traditional/simplified/mixed) queries.

In parallel, I'll try to talk @dcausse into reviewing the code to see if anything sticks out as particularly unmaintainable.

An Example
For reference in the discussion outline below, here's an example.

Right now, searching for traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.

Searching for the mixed T/S query (the last two characters, meaning "finals", are traditional, the rest is simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of traditional and simplified characters that are possible for this query.

Draft Proposal
Once we have all the necessary tools available, we have to figure out how best to deploy them.

The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is:

  • Convert everything to simplified characters for indexing and use a segmenter to break the text into words, in both the text and plain fields. Do the same for normal queries at search time.
  • Index the text as is in the source plain field, and use a unigram segmenter.

The working assumption here is that whether a typical searcher searches for simplified 年欧洲冠军联赛决赛, traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the characters are traditional or simplified. When they use quotes, they want those words as a phrase.

For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, traditional, and mixed examples above), insource:would provide that ability.

We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.

Full write up is available in my Notes pages. Summary, Recommendation, & Plans copied below.

Summary

  • Chinese Wikis support Simplified and Traditional input and display. Text is stored as it is input, and converted at display time (by a PHP module called ZhConversion.php).
  • All Chinese-specific Elasticsearch tokenizers/segmenters/analyzers I can find only work on Simplified text. The SIGHAN 2005 Word Segmentation Bakeoff had participants that segmented Traditional and Simplified text, but mixed texts were not tested.
  • STConvert is an Elasticsearch plugin that does T2S conversion. It agrees with ZHConversion.php about 99.9% of the time on a sample of 1,000 Chinese Wiki articles.
    • It's good that they agree! We wouldn't want conversions for search/indexing and display to frequently disagree; that would be very confusing for users.
    • Using a Traditional to Simplified (T2S) converter greatly improves the performance of several potential Elasticsearch Chinese tokenizers/segmenters on Traditional text. (Tested on SIGHAN data.)
    • I uncovered two small bugs in the STConvert rules. I've filed a bug report and implemented a char_filter patch as a workaround.
  • SmartCN + STConvert is the best tokenizer combination (on the SIGHAN data). It performs a bit better than everything else on Traditional text and much better on Simplified text.
    • Our historically poor opinion of SmartCN may have been at least partly caused by the fact that it only really works on Simplified characters; and so it would perform poorly on mixed Traditional/Simplified text.
    • There are significantly fewer types of words (~16%) with SmartCN + STConvert compared to the current prod config, indicating more multi-character words are found. About 28% of tokens are have no words they are indexed with (i.e., mostly Traditional and Simplified forms being indexed together).
    • Search with SmartCN + STConvert works as you would hope: Traditional and Simplified versions of the same text find each other, highlighting works regardless of underlying text type, and none of the myriad quotes (" " — ' ' — “ ” — ‘ ’ — 「 」 — 『 』 — 「 」) in the text affect results. (Regular quotes in the query are "special syntax" and do the usual phrase searching.)
    • SmartCN + STConvert doesn't tokenize some non-CJK Unicode characters as well as one would like. Adding the icu_normalizer as a pre-filter fixes many problems, but not all. The remaining issues I still see are with some uncommon Unicode characters: IPA and slash characters: ½ ℀ ℅. Searching for most works as you would expect (except for numerical fractions).

Recommendation

  • Deploy SmartCN + STConvert to production for Chinese wikis after an opportunity for community review (and after the ES5 upgrade is complete).

Plan

  • Wait for RelForge to be downgraded to ES 5.1.2; when it is declared stable ("stable for RelForge") re-index Chinese Wikipedia there and let people use it and give feedback. Test that everything works in ES5 as expected.
  • If/When vagrant update to ES5 is available, test there, in addition/instead.
  • Update the plugin-conditional analyzer configuration to require two plugin dependencies (i.e., SmartCN and STConvert)—currently it seems is can only be conditioned on one.
  • After ES5 is deployed and everything else checks out, deploy to SmartCN and STConvert to production, enable the new analysis config, and re-index the Chinese projects.

This was waiting for the upgrade to ES5 on RelForge, which is now done. The next step is to deploy the analyser on RelForge so that it can be tested by the Chinese-language community.

@mpopov, @chelsyx: Trey would appreciate you reading this analysis and giving some comments, at some point. :-)

Thanks for the analysis!
This looks very promising, I will install the 2 plugins on relforge.
While we are working on this would it make sense to try again the new QueryBuilder we deployed for BM25. I believe that a better tokenizer might allow us to get rid of QueryString by default (and its auto_convert_to_phrase thing).
Would it make sense to create 2 relforge profiles: one with the old query_string and one with the new builder?

@chelsyx is taking a look at this from an analyst perspective. :)

@TJones, Thank you for the analysis! I read through your write up and learned a lot! (I learned how messy my language is :P)
Just want to confirm that STConvert and ZhConversion are closed enough. We can't do anything with the disagreement listed in this table. For example, "著" can be either traditional or simplified according to its different meaning; and mismatch like "馀 vs 餘 vs 余" and "钟 vs 锺 vs 鍾 vs 鐘" are messiness introduced when Chinese were simplified, some rules are still changing nowadays.

@chelsyx , thanks for looking it over. Language is always messy—that's what makes it fun!

STConvert and ZhConversion are much closer than I feared. I was worried there'd be something like only 80% overlap, which would be a disaster. The 99.9% overlap isn't perfect—1 in 1000 characters being converted differently could cause problems in search, but it's much better than the current situation where uncoverted characters make up about 1/8 of the text whether you read Traditional or Simplified (based on test conversion rates). So, 1 in 8 conversion errors vs 1 in 1000 conversion errors, plus much more reliable segmentation seems worth it—especially since there aren't any other good alternatives.

My intuition—which I admit is even more tenuous than usual here—is that some of the rarer characters that have disagreements are going to be even rarer in queries, which would make it less than 1 in 1000 for search.

Complications like 著 sometimes being traditional and sometimes being simplified are indeed messy. Both converters try to handle some of that difficulty by mapping multi-character strings, which can at least capture some compounds and common contexts, but not everything.

We probably can improve the situation with clearer errors, over time. The STConvert author responds to issues, so we could post some suggested fixes for particular points of disagreement and move STConvert closer to ZHConversion. Similarly, there is a wiki page somewhere (I can't find it at the moment) to add suggestions for fixes to ZHConversion—moving it closer to STConvert. (Also ZHConversion is under active development, and my analysis is already out of date because a newer version exists.)

If you are up for a review, I can put together a fuller, more carefully constructed list of disagreements, you figure out what the right answers are, and you or I can post them to the relevant places to encourage improvements in both.

Also, I may be able to fix some of the disagreements by hacking the character filter used by elastic search, and hard-code a few conversions into indexing and searching, thus faking corrections to STConvert.

The demo site (index only—so results and snippets, but no articles) is up now: http://zh-wp-smartcn-relforge.wmflabs.org/w/index.php?title=Special:搜索

As an example—on Chinese Wikipedia right now 飓风 ("hurricane", Simplified) gets 1128 results; 颶風 ("hurricane", Traditional) gets 1835 results.

On the demo index, both 飓风 and 颶風 get 2355 results, though the ranking is different, because we still prefer exact matches. In each case, an article with the exact characters of the search terms (Simplified or Traditional) is the first result.

In some cases, this may give worse results, as it seems to do with 颶風 (Trad). There seems to be a redirect from 颶風 (Trad) to 飓风 (Simpl), which does not rank first when searching on 颶風 (Trad) with the new analyzer. However, in this case, at least, the Go Feature does the right thing.

I'm working on getting a message out to the Village Pump so we can get some feedback on how actual speakers of Chinese feel about it.

And the post to the Chinese Technical Village Pump has been made. Thanks to @chelsyx for translation assistance!

[Edit to update link]

The new one works much better in that it also knows how to split words. (search ”中国大陆电台干扰“, expect ”中国无线电干扰“) Great work!

This is kind of on hold while we try to set up another instance of Chinese Wikipedia in Labs/RelForge. Because of some interaction with the update to ES5, the current RelForge instance uses BM25, which generally works worse with spaceless languages—though it seems to work okay! We want to set up a more prod-like version using TF/IDF in labs so we can isolate the effects of the new language analyzer. We can also then compare the TF/IDF vs BM25 difference between the two configs.

Change 350280 had a related patch set uploaded (by Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Analysis if SmartCN and STConvert are Installed

https://gerrit.wikimedia.org/r/350280

After review of the review from native speakers and discussion with the team, the plan is to deploy the new Chinese analysis chain (this ticket), deploy the plugins (T160948) enable BM25 (which now works better with the new analysis in place—T163829), and reindex (T163832)

Change 350280 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Analysis if SmartCN and STConvert are Installed

https://gerrit.wikimedia.org/r/350280

This has been merged but not yet in production, as there are separate tasks for re-indexing, etc