After the research in T158202 has found some analyzers for Chinese that are potentially better, we will test them, and analyze to see if they are better or not. If they are, we will file a task to deploy one of them.
Related tasks:
After the research in T158202 has found some analyzers for Chinese that are potentially better, we will test them, and analyze to see if they are better or not. If they are, we will file a task to deploy one of them.
Related tasks:
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Enable Chinese Analysis if SmartCN and STConvert are Installed | mediawiki/extensions/CirrusSearch | master | +77 -13 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages | |||
Open | None | T154511 [Tracking] Research, test, and deploy new language analyzers | |||
Resolved | TJones | T158202 [Research spike, 4 hours] Research Chinese language analyzers | |||
Resolved | TJones | T158203 Test and analyze new Chinese language analyzers | |||
Resolved | TJones | T163829 Enable BM25 for Chinese wikis | |||
Resolved | None | T163832 Reindex Chinese wikis | |||
Resolved | TJones | T166722 Disable SmartCN for zh-hans |
Based on the research from T158202, my analysis & test plan is as follows:
All of the Chinese segmentation candidates I've found to date expect simplified Chinese characters as input. Chinese Wikipedia supports both traditional and simplified characters, and converts them at display time according to user preferences. There is a Elasticsearch plugin (STConvert) that converts traditional to simplified or vice versa. I suggest trying to set up an analysis chain using that, and segment and index everything as simplified.
There are a number of segmenters to consider: SmartCN, IK, and MMSEG are all available with up-to-date Elasticsearch plugin wrappers.
So, my evaluation plan is to first compare the output of STConvert with MediaWiki ZhConversion.php to make sure they do not have wildly differing results, and maybe offer some cross-pollination to bring them more in line with each other if that seem profitable.
I'll try to set up the SigHan analysis framework to evaluate the performance of the segmenters on that test set. If there is no clear cut best segmenter, I'll take some text from Chinese Wikipedia, apply STConvert, segment the text with each of the contenders, and collect the instances where they differ for manual review by a Chinese speaker. This should allow us to focus on the differences found in a larger and more relevant corpus.
I'll also review the frameworks and see how amenable each is to patching to solve specific segmentation problems. Being much more easily patched might be more valuable than 0.02% better accuracy, for example.
We'll also test the highlighting for cross-character type (traditional/simplified/mixed) queries.
In parallel, I'll try to talk @dcausse into reviewing the code to see if anything sticks out as particularly unmaintainable.
An Example
For reference in the discussion outline below, here's an example.
Right now, searching for traditional 歐洲冠軍聯賽決賽 ("UEFA Champions League Final") returns 82 results. Searching for simplified 欧洲冠军联赛决赛 gives 115 results. Searching for 欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 gives 178 results—so they have some overlapping results.
Searching for the mixed T/S query (the last two characters, meaning "finals", are traditional, the rest is simplified) 欧洲冠军联赛決賽 gives 9 results. Adding it to the big OR (欧洲冠军联赛决赛 OR 歐洲冠軍聯賽決賽 OR 欧洲冠军联赛決賽) gives 184 results, so 6 of the 9 mixed results are not included in the original 178. This is just one example that I know of. There are obviously other mixes of traditional and simplified characters that are possible for this query.
Draft Proposal
Once we have all the necessary tools available, we have to figure out how best to deploy them.
The current draft proposal, after discussing what's possible with the Elasticsearch inner workings with David & Erik, is:
The working assumption here is that whether a typical searcher searches for simplified 年欧洲冠军联赛决赛, traditional 年歐洲冠軍聯賽決賽, or mixed 年欧洲冠军联赛決賽, they are looking for the words in the query, regardless of whether the characters are traditional or simplified. When they use quotes, they want those words as a phrase.
For advanced searchers or editors who want to find specific characters (e.g., to distinguish the simple, traditional, and mixed examples above), insource:would provide that ability.
We will of course verify that this is a decent plan with the community after we figure out what's actually possible with the tools we have available.
Full write up is available in my Notes pages. Summary, Recommendation, & Plans copied below.
Summary
Recommendation
Plan
This was waiting for the upgrade to ES5 on RelForge, which is now done. The next step is to deploy the analyser on RelForge so that it can be tested by the Chinese-language community.
Thanks for the analysis!
This looks very promising, I will install the 2 plugins on relforge.
While we are working on this would it make sense to try again the new QueryBuilder we deployed for BM25. I believe that a better tokenizer might allow us to get rid of QueryString by default (and its auto_convert_to_phrase thing).
Would it make sense to create 2 relforge profiles: one with the old query_string and one with the new builder?
@TJones, Thank you for the analysis! I read through your write up and learned a lot! (I learned how messy my language is :P)
Just want to confirm that STConvert and ZhConversion are closed enough. We can't do anything with the disagreement listed in this table. For example, "著" can be either traditional or simplified according to its different meaning; and mismatch like "馀 vs 餘 vs 余" and "钟 vs 锺 vs 鍾 vs 鐘" are messiness introduced when Chinese were simplified, some rules are still changing nowadays.
@chelsyx , thanks for looking it over. Language is always messy—that's what makes it fun!
STConvert and ZhConversion are much closer than I feared. I was worried there'd be something like only 80% overlap, which would be a disaster. The 99.9% overlap isn't perfect—1 in 1000 characters being converted differently could cause problems in search, but it's much better than the current situation where uncoverted characters make up about 1/8 of the text whether you read Traditional or Simplified (based on test conversion rates). So, 1 in 8 conversion errors vs 1 in 1000 conversion errors, plus much more reliable segmentation seems worth it—especially since there aren't any other good alternatives.
My intuition—which I admit is even more tenuous than usual here—is that some of the rarer characters that have disagreements are going to be even rarer in queries, which would make it less than 1 in 1000 for search.
Complications like 著 sometimes being traditional and sometimes being simplified are indeed messy. Both converters try to handle some of that difficulty by mapping multi-character strings, which can at least capture some compounds and common contexts, but not everything.
We probably can improve the situation with clearer errors, over time. The STConvert author responds to issues, so we could post some suggested fixes for particular points of disagreement and move STConvert closer to ZHConversion. Similarly, there is a wiki page somewhere (I can't find it at the moment) to add suggestions for fixes to ZHConversion—moving it closer to STConvert. (Also ZHConversion is under active development, and my analysis is already out of date because a newer version exists.)
If you are up for a review, I can put together a fuller, more carefully constructed list of disagreements, you figure out what the right answers are, and you or I can post them to the relevant places to encourage improvements in both.
Also, I may be able to fix some of the disagreements by hacking the character filter used by elastic search, and hard-code a few conversions into indexing and searching, thus faking corrections to STConvert.
The demo site (index only—so results and snippets, but no articles) is up now: http://zh-wp-smartcn-relforge.wmflabs.org/w/index.php?title=Special:搜索
As an example—on Chinese Wikipedia right now 飓风 ("hurricane", Simplified) gets 1128 results; 颶風 ("hurricane", Traditional) gets 1835 results.
On the demo index, both 飓风 and 颶風 get 2355 results, though the ranking is different, because we still prefer exact matches. In each case, an article with the exact characters of the search terms (Simplified or Traditional) is the first result.
In some cases, this may give worse results, as it seems to do with 颶風 (Trad). There seems to be a redirect from 颶風 (Trad) to 飓风 (Simpl), which does not rank first when searching on 颶風 (Trad) with the new analyzer. However, in this case, at least, the Go Feature does the right thing.
I'm working on getting a message out to the Village Pump so we can get some feedback on how actual speakers of Chinese feel about it.
And the post to the Chinese Technical Village Pump has been made. Thanks to @chelsyx for translation assistance!
[Edit to update link]
Based on comments on the Village Pump, this seems to cover T77967, too.
T77967: Language converter can't work on the results of Special:Search
The new one works much better in that it also knows how to split words. (search ”中国大陆电台干扰“, expect ”中国无线电干扰“) Great work!
This is kind of on hold while we try to set up another instance of Chinese Wikipedia in Labs/RelForge. Because of some interaction with the update to ES5, the current RelForge instance uses BM25, which generally works worse with spaceless languages—though it seems to work okay! We want to set up a more prod-like version using TF/IDF in labs so we can isolate the effects of the new language analyzer. We can also then compare the TF/IDF vs BM25 difference between the two configs.
Change 350280 had a related patch set uploaded (by Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Analysis if SmartCN and STConvert are Installed
Change 350280 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Chinese Analysis if SmartCN and STConvert are Installed
This has been merged but not yet in production, as there are separate tasks for re-indexing, etc