Page MenuHomePhabricator

Search indexes limited to first 100k words (MAX_FIELD_LENGTH)
Closed, DeclinedPublic

Description

Author: simon

Description:
There's a suggestion currently at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Web_scraping_tool_for_article_research_.28list_expansion.29 that the search indexes only the first 100k words in a page.

This means that important stuff at the bottom of a very long page is not included in the index, which is a bad thing.

Is there any possibility this restriction - if it exists - could be lifted such that all of the text is indexed?


Version: unspecified
Severity: normal

Details

Reference
bz32871

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:05 AM
bzimport set Reference to bz32871.
bzimport added a subscriber: Unknown Object (MLST).

rainman wrote:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search-2/src/org/wikimedia/lsearch/index/WikiIndexModifier.java?revision=63824&view=markup

static public final int MAX_FIELD_LENGTH = 100000;

It could be increased, however, I don't remember offhand what were the issues with increasing this number.

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

lsearchd has reached end of life and will not be improved further. Marking this WONTFIX as a result.

We don't have this limit in CirrusSearch.