Page MenuHomePhabricator

Don't let MySQL's stopword list prevent indexing of those words, as we want to search them
Open, LowestPublic

Description

Author: timwi

Description:
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=681366&group_id=34373&atid=411192
Originally submitted by Nobody/Anonymous - nobody 2003-02-06 01:18

Stopwords in English can be valid nontrivial words in
other languages. Please allow searching them! We
cannot search "an", "he", "me" etc on Polish Wikipedia.
And we cannot search "see also" etc as well which
were put and left (unfortunatelly) without translating
them (many pages!)

--Youandme

  • Additional comments ------------------------

Date: 2003-02-06 20:43
Sender: SF user vibber

When we upgrade mysql, I'll see if I can remove the stopword

list. (It's a compile-time thing.)

Date: 2003-02-06 20:44
Sender: SF user vibber

When we upgrade mysql, I'll see if we can remove the
stopword list. (It's a compiled-in thing, apparently.)


Version: unspecified
Severity: normal
URL: http://dev.mysql.com/doc/refman/5.6/en/fulltext-stopwords.html

Details

Reference
bz352

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 6:53 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz352.
bzimport added a subscriber: Unknown Object (MLST).

we dropped mysql 3.x support with MediaWiki 1.6.

MySQL 4 and later still have a stopword list, though they aren't as unpleasant as the behavior in previous versions.

It would be nice if we could reliably disable it per table or something...

Yes, please override it with an own customizable list for users without lucene search.

Created attachment 7143
MaxSem's slow patch

Best I could come up with - but still pretty slow, maintenance/rebuildtextindex.php runs 30% slower with it. Tested several solutions (oneo of them could be seen in the patch, commented out), but none of them had satisfiable performance. I therefore don't dare to commit it into the trunk. Leaving the patch here so that other folks could take a look at my approach.

Attached:

  • Bug 25446 has been marked as a duplicate of this bug. ***

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

See http://dev.mysql.com/doc/refman/5.1/en/fulltext-fine-tuning.html which says "To override the default stopword list, set the ft_stopword_file system variable. ... if you change the stopword file itself, you must rebuild your FULLTEXT indexes after making the changes and restarting the server. To rebuild the indexes in this case, it is sufficient to do a QUICK repair operation: REPAIR TABLE tbl_name QUICK;"

So, while you can't "reliably disable it per table", you *can* disable it without compiling by setting ft_stopword_file to "", restarting, and then rebuilding the table.

(In reply to comment #8)

So, while you can't "reliably disable it per table", you *can* disable it
without compiling by setting ft_stopword_file to "", restarting, and then
rebuilding the table.

A task for installer?

Just checking: in the times of Cirrus Search, are MySQL's stopwords in English causing any trouble to searches in non-English wikis?

No, nothing like this from the SQL search implementation affects Cirrus' implementation.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Given this is not affecting Wikimedia sites (as the original report seemed to care about) and that non-Wikimedia sites using MySQL as a search engine can use T2352#29893 as probably the cleanest way to achieve this, I would decline this, unless someone is very, very insterested or wants to volunteer for an installer patch (which may not work properly anyway, as it needs database server access, which not all installations might have). Maybe it could be considered for a container-based distribution, only?

MediaWiki is still a product for third parties, so this is a valid bug, even if no WMF engineer is going to work on this.

@MaxSem My comment is not "based on WMF needs". As a MySQL-guy, I think it is almost impossible to fulfill (I do not think the installer can do that reliably), and that the right way to fix this is to document the workaround as the right way to do it. If it was on mediawiki, I would expect it to work on all cases where the requirements are right, and the only way to do it is more of an infrastructure/deployment fix (unless mediawiki starts to start and stop the database by itself).

Deskana subscribed.

Discovery-ARCHIVED only maintains CirrusSearch (i.e. Elasticsearch) backed search, so this is out of scope for us; removing the Discovery-Search tag.