This report applies to MW 1.35 and onward.
A SearchEngine gets the indexable fields for a document from a ContentHandler by calling ContentHandler::getFieldsForSearchIndex() and ContentHandler::getDataForSearchIndex(). (Though, as far as I know, CirrusSearch is the only search engine using this interface at the moment.)
SearchIndexFieldsHook and SearchDataForIndexHook are hooks that allow extensions to amend the fields provided by a ContentHandler to the SearchEngine. There appear to be two problems with how these hooks are implemented.
- SearchDataForIndexHook is run by getDataForSearchIndex(), but it is run too early. The base implementation of getDataForSearchIndex() runs the hook at the end (which is good), but subclass implementors are instructed the call the parent method, and they invariably choose to call the parent method first --- with the result that the hook is called somewhere in the middle. This means that the hook does not have access to all the output of the content handler, and that the content handler may stomp on the work performed by the hook.
- SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).
(The one place that SearchIndexFieldsHook is executed is in SearchEngine::getSearchIndexFields(). CirrusSearch calls this in maintenance code to get a list of all possible fields when an admin initially sets up the Elasticsearch indexes.)
Why does this matter to me? Personally, I am concerned with (1), because I am trying to write an extension that adds more text to the file_text field via the SearchDataForIndexHook. But, since the hook runs too early, FileContentHandler obliviously stomps on the changes made to that field. Since my extension does not add any new fields, I am not concerned with (2) at the moment, but it is a related problem.
I have a patch to submit that should take care of (1) with minimal disruption to any existing code.
I would be happy to write a second patch to handle (2) in a similar way, but that would bear a bit more discussion first. It is a bigger change since ContentHandler::getFieldsForSearchIndex() does not currently invoke the SearchIndexFieldsHook at all.