Multiple search terms are not enforced properly for Chinese
Closed, ResolvedPublic

Description

Here the search string I give is "逢甲", so why is it like I merely
typed "甲'?

$ w3m -dump "http://taizhongbus.jidanni.org/index.php?search=逢甲&fulltext=搜索"
Problem 1: raw $1:
有關搜索中公的更多詳情,參見$1。

  1. 大甲-龜殼村-海墘 (344字節)

Problem: 2 it also matches on only one character of my two character query:

  1. 大甲-海尾子 (685字節)
  2. 大甲-外埔-土城 (421字節)
  1. 大甲-龜殼村-海墘 (344字節)
  2. 大甲-豐原 (884字節)

The website is online, for you to test.


Version: 1.16.x
Severity: normal

bzimport added a project: MediaWiki-Search.Via ConduitNov 21 2014, 9:30 PM
bzimport set Reference to bz8445.
Jidanni created this task.Via LegacyDec 31 2006, 2:41 PM
brion added a comment.Via ConduitDec 23 2008, 2:12 AM

Ok, it looks like the splitting of characters (done to compensate for the lack of word spacing in Chinese text) is happening after the boolean search query is constructed, leading to failure:

The input:
'逢甲'

is translated to a boolean query for a single required word:
'+逢甲"

which then gets split up by character, then encoded to compensate for encoding bugs:
'+ U8e980a2 U8e794b2'

The '+' gets detached from the characters, so has no affect, and the search backend will returns results that contain either character instead of requiring both.

As a workaround, you can quote the multi-character string, which ends up encoding correctly for a phrase search:
'+" U8e980a2 U8e794b2"'

Jidanni added a comment.Via ConduitMay 20 2009, 1:36 PM

OK, comparing

http://radioscanningtw.jidanni.org/index.php?search=學甲&ns0=1&title=特殊:搜尋&fulltext=Search
http://radioscanningtw.jidanni.org/index.php?search='學甲'&ns0=1&title=特殊:搜尋&fulltext=Search
http://radioscanningtw.jidanni.org/index.php?search="學甲"&ns0=1&title=特殊:搜尋&fulltext=Search

it is clear only the final form gives correct results.

Could you fellows glue the + that has fallen off, back on, there
behind the scenes?

Wouldn't that be better than Asian sites' users thinking Search is broken, or MediaWiki
needing to add instructions telling Asian users to double "quote" "every" Asian "string"
they want to search.

Jidanni added a comment.Via ConduitMay 27 2009, 8:50 PM

Alas, I see WMF doesn't use SpecialSearch.php anymore, but
these extensions instead,

$ w3m -dump http://zh.wikipedia.org/wiki/Special:Version | grep Search
MWSearch MWSearch plugin Brion Vibber and
OpenSearchXml OpenSearch XML interface Brion Vibber

So the best I can do for now is put a message in
MediaWiki:Searchresulttext: "If searching Chinese, try your search
again with quote marks, 逢甲 -> "逢甲" . Sorry".

brion added a comment.Via ConduitMay 27 2009, 9:12 PM

SpecialSearch.php provides the front-end UI, and is indeed used on Wikimedia sites.

MWSearch provides an alternate back-end. PostgreSQL users also have a different search back-end. Unsurprisingly, different back-ends have different properties and do not all share the same bugs.

Jidanni added a comment.Via ConduitJun 9 2009, 6:02 AM

Created attachment 6211
CJK quoter

How about this patch. Seems to work and maybe not break anything else.
All I'm trying to do is type those quote marks that Brion mentioned
for the user behind the scenes, instead of asking them up front to
type them in, via some embarrassing message. Otherwise what is the
logic of distributing a broken search without the least warning to the
user?

But as Wikipedia uses a better search, repairing this worse search
will be a uphill battle, as without being forced to eat your own
medicine, you won't have any impetus to improve it.

So mediawiki should distribute the good stuff it uses itself instead.

Anyway, note that I only patched zh-hans. This will not help the other
CJK languages that already have their own
languages/classes/Language*.php. Fortunately zh-tw doesn't, so it will
get this fix.

As far as patch quality, well, as it seems nobody cares much about
this old search function, just chuck it in, better than nothing.

All I know is it works for me here on MySQL Linux etc.

Attached: CJKsearchFix.txt

brion added a comment.Via ConduitJun 23 2009, 11:24 PM

Patch as written can result in double-quoting, causing searches to fail if quotes were used in the original search term. With no quotes in input it seems ok... should be possible to tweak it to not add double-quotes.

Jidanni added a comment.Via ConduitJun 23 2009, 11:45 PM

OK, tomorrow I will make the patch first scan to see if the user has put
any double quote marks in their input, and not tamper with their input if so.

Glad to know this is the right place to fix this bug, so I needn't look deeper
under the hood.

Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here.

brion added a comment.Via ConduitJun 24 2009, 2:28 AM

Implementation committed in r52338:

Big fixup for Chinese word breaks and variant conversions in the MySQL search backend...

  • removed redunant variant terms for Chinese, which forces all search indexing to canonical zh-hans
  • added parens to properly group variants for languages such as Serbian which do need them at search time
  • added quotes to properly group multi-word terms coming out of stripForSearch, as for Chinese where we segment up the characters. This is based on Language::hasWordBreaks() check.
  • also cleaned up LanguageZh_hans::stripForSearch() to just do segmentation and pass on the Unicode stripping to the base Language implementation, avoiding scary code duplication. Segmentation was already pulled up to LanguageZh, but was being run again at the second level. :P
  • made a fix to Chinese word segmentation to handle the case where a Han character is followed by a Latin char or numeral; a space is now added after as well. Spaces are then normalized for prettiness.
bzimport added a comment.Via ConduitJun 24 2009, 5:29 AM

hippytrail wrote:

"Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here."

Not all CJK languages omit interword spaces and not all languages which omit interword spaces are CJK:

  • Korean does use spaces between words. Quite possibly a full-width space character rather than ASCII 0x20.
  • Thai and Khmer (Cambodian) do not use spaces between words.
  • Note that both Unicode and HTML include means of indicating invisible word breaks for such languages. Then again a quick Google seems to indicate that the HTML "WBR" tag is neither official nor interpreted to have the same semantics by everybody.

Another approach would be to harvest Han compounds from souces such as EDICT, CEDICT, and the various Wiktionaries. Google does morphological analysis to determine which strings of Han characters are compounds that should be treated as words.

Andrew Dunbar (hippietrail)

Jidanni added a comment.Via ConduitJun 26 2009, 2:39 PM

Glad Chinese is finally fixed. No need for anymore "try Google instead"
in MediaWiki:Searchresulttext!

Another approach would be to harvest Han compounds from souces such as EDICT,

Well my wikis' compounds are all police department and bus stop names:
http://jidanni.org/comp/wiki/article-category.html .

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.