Search results highlight partial word matches
Closed, ResolvedPublic

Assigned To
None
Priority
Low
Author
bzimport
Subscribers
wikibugs-l
Projects
Reference
bz278
Description

Author: morbus

Description:
I'm having "issues" with searching that I'm not exactly sure how to solve, and all of these are evident at the specified URL. In
essence, if I search for the word "four", I get absolutely no results. The SQL in question is roughly: SELECT * from searchindex
where MATCH (si_text) AGAINST ('+four' IN BOOLEAN MODE); (this is for MySQL 4, naturally). But, if I turn around and do a
decidedly MySQL 3.x query: SELECT * from searchindex where si_text LIKE '%four%'; I get back the two entries I expect. This
seems to tell me that the searchindex table is "Ok". To doublecheck, I dumped the table, deleted it, recreated it, and reimported
the data (thus recreated the indexes). Same result.

The real goal here is to show all matches for the word "EC" - I don't want "suspect" to be matched, but I want "-20 EC." and
similar entries (EC is a date measurement). To let MySQL search for these smaller words, I've already modified the my.cnf and set
it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK). But, somewhere in the wiki code (at the very least in
the display settings), searches are being done as strings, and not word boundaries. Is there anyway to force a word boundary? To
make matters worse, searching for "ec" at http://gamegrene.com/wiki/ "works" (because of my edit to my.cnf) but matches on
"suspect". However, searching for "ur", which should match on "procedure", doesn't return any results (but "procedure" does, as
opposed to "four").

ARggGh!


Version: 1.3.x
Severity: minor
URL: http://gamegrene.com/wiki/Special:Search?search=ec&fulltext=Search

bzimport added a project: MediaWiki-Search.Via ConduitNov 21 2014, 6:48 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz278.
bzimport created this task.Via LegacySep 3 2004, 1:18 AM
brion added a comment.Via ConduitSep 3 2004, 2:43 AM

These are limitations of MySQL's full text search engine. You need to adjust MySQL's stopword list (which ignores "four") and
minimum word length (which ignores "EC"). Please see: http:
//dev.mysql.com/doc/mysql/en/Fulltext_Fine-tuning.html

bzimport added a comment.Via ConduitSep 3 2004, 11:55 AM

morbus wrote:

As mentioned in the initial report, I already have revised MySQL's fulltext index: "To let MySQL search for these smaller words, I've already modified
the my.cnf and set it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK)." - otherwise, I wouldn't get any results at all for EC,
which I am (as per the original report). As for "four", that I didn't know, and I'll correct that shortly.

bzimport added a comment.Via ConduitSep 3 2004, 1:30 PM

morbus wrote:

Just to reiterate clearer:

  • I've increased the full text search to 2 letters.
  • I've rebuilt the table indexes with no success.
  • I've deleted, recreated, and reimported the searchindex table.
  • I want to search on word boundaries such that "EC" does not match "suspect".
  • When searching for "EC" at Gamegrene, we get five pages that I know match.
  • However, I don't know what exactly is matched. If MySQL MATCH() does word

boundaries,

then the MW display does string searching (as it always shows "suspect").
  • "ur" as in "procedure" shows no matches; "procedure" does.
brion added a comment.Via ConduitSep 3 2004, 4:03 PM

http://gamegrene.com/wiki/Special:Search?search=ec&fulltext=Search only returns pages which contain "EC" by itself.

Can you clarify what exactly your problem is?

bzimport added a comment.Via ConduitSep 3 2004, 4:30 PM

morbus wrote:

From three of my machines (different IPs, logged in or not), and another
person's machine entirely, we're NOT seeing "EC" by itself (word boundary).
We're seeing EC as a string. For instance, one of the returned results shows the
below, which is matching on "effect", "secret", and "ineffective".

Avazian Box (2331 bytes)

1: ...d quickly. This advancement came with the side effect of immense
greed. Many highly advanced magnetic ...
3: ...g new magnetic propulsion technologies, formed a secret team
intending to thwart the ongoing conflict.
5: ...which rendered all weapons of Avazian origin ineffective, and the
absorption of the magnetic field wou...

bzimport added a comment.Via ConduitSep 3 2004, 4:52 PM

bugzilla_wikipedia_org.to.jamesd wrote:

"ec" is matched in the middle of a word. Other two character sequences are
typically not matched in the middle of a word. The desired behavior is to match
ec only when it is a whole word, not in the middle of words.

brion added a comment.Via ConduitSep 3 2004, 4:59 PM

Can you explain what you mean by "match"? As far as I can tell, the search is *ONLY* returning pages in which "EC"
appears as a distinct word when asked to search for "EC". Nothing else. No other pages are returned.

So, is this about the *searching*?

Or, is it about the *highlighting* of text extracts in the search results display?

Can you please clarify?

bzimport added a comment.Via ConduitSep 3 2004, 5:29 PM

morbus wrote:

Brion - exactly, that's what I don't know (from a previous entry): "When
searching for "EC" at Gamegrene, we get five pages that I know match. If MySQL
MATCH() does word boundaries, then the MW display does string searching (as it
always shows "suspect")."

If MySQL MATCH() does do word boundaries, then yeah, I guess I'm reporting a bug
in the display code (specifically, showHit() in SearchEngine.php).

Thanks for the patience.

brion added a comment.Via ConduitSep 3 2004, 5:34 PM

Morbus, for general information on the fulltext search engine see http://dev.mysql.com/doc/mysql/en/
Fulltext_Boolean.html

Matches are on full words unless you use the * operator (eg, search for "apple*" finds "applet" and "applesauce" but search
for "apple" does not).

Changed summary and sample URL to reflect the problem.

bzimport added a comment.Via ConduitSep 3 2004, 8:25 PM

morbus wrote:

This not been heavily tested yet, but the following revision
in SearchEngine.php:showHit() seems to do what I want:

$pat1 = "/(.*)(\b" . implode( "|", $this->mSearchterms ) . "\b)(.*)/i";

The generated pattern then becomes /(.*)(\bEC\b)(.*)/i or, in the case of
multiple searches /(.*)(\b20|EC\b)(.*)/i. This code is currently live
at the provided URL, so you can test as needed.

bzimport added a comment.Via ConduitSep 3 2004, 8:28 PM

morbus wrote:

Sorry - the correct revision is:

$pat1 = "/(.*)(\b" . implode( "\b|\b", $this->mSearchterms ) . "\b)(.*)/i";

which creates a pattern like /(.*)(\b20\b|\bEC\b)(.*)/i.

brion added a comment.Via ConduitOct 1 2007, 1:11 PM

Fixed in r26269 for mainline, r26271 for lucenesearch extension.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.