Page MenuHomePhabricator

SearchDataForIndexHook and SearchIndexFieldsHook run too early, or not at all
Closed, DeclinedPublicBUG REPORT

Description

This report applies to MW 1.35 and onward.

A SearchEngine gets the indexable fields for a document from a ContentHandler by calling ContentHandler::getFieldsForSearchIndex() and ContentHandler::getDataForSearchIndex(). (Though, as far as I know, CirrusSearch is the only search engine using this interface at the moment.)
SearchIndexFieldsHook and SearchDataForIndexHook are hooks that allow extensions to amend the fields provided by a ContentHandler to the SearchEngine. There appear to be two problems with how these hooks are implemented.

  1. SearchDataForIndexHook is run by getDataForSearchIndex(), but it is run too early. The base implementation of getDataForSearchIndex() runs the hook at the end (which is good), but subclass implementors are instructed the call the parent method, and they invariably choose to call the parent method first --- with the result that the hook is called somewhere in the middle. This means that the hook does not have access to all the output of the content handler, and that the content handler may stomp on the work performed by the hook.
  1. SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).

(The one place that SearchIndexFieldsHook is executed is in SearchEngine::getSearchIndexFields(). CirrusSearch calls this in maintenance code to get a list of all possible fields when an admin initially sets up the Elasticsearch indexes.)

Why does this matter to me? Personally, I am concerned with (1), because I am trying to write an extension that adds more text to the file_text field via the SearchDataForIndexHook. But, since the hook runs too early, FileContentHandler obliviously stomps on the changes made to that field. Since my extension does not add any new fields, I am not concerned with (2) at the moment, but it is a related problem.

I have a patch to submit that should take care of (1) with minimal disruption to any existing code.

I would be happy to write a second patch to handle (2) in a similar way, but that would bear a bit more discussion first. It is a bigger change since ContentHandler::getFieldsForSearchIndex() does not currently invoke the SearchIndexFieldsHook at all.

Event Timeline

Change 734801 had a related patch set uploaded (by Maddog; author: Maddog):

[mediawiki/core@master] content: Ensure SearchDataForIndexHook runs at end of getDataForSearchIndex()

https://gerrit.wikimedia.org/r/734801

Thanks for creating this issue and sending a patch!

I have few questions that might help clarify why you want to achieve this:

I am trying to write an extension that adds more text to the file_text field via the SearchDataForIndexHook

Could you describe more the use-case you want to achieve? Adding data to the file_text field sounds like an implementation detail to achieve some higher level goals and I think it would helpful to describe them.

SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).

I'm not sure I understand the problem here? Could you describe more what is the problem that it causes? Since there are multiple content handlers the SearchEngine calls getFieldsForSearchIndex on each of those then calls the hook for extensions to add their own fields and then merge all this, calling the hook for every possible ContentHandler would mean it's called multiple times: once per ContentHandler.

One concern related to your approach:

  • I don't think that the ability to alter the data produced by another component was something we wanted explicitly in the first place and this is what this approach is now making explicit. The original intention of these two hooks was to add new fields not alter existing ones even though I agree that the hook signature makes it possible to do so. Altering others field I think falls in the use at your own risk disclaimer.

...

I have few questions that might help clarify why you want to achieve this:

Addressing the questions in semi-reversed order:

One concern related to your approach:

  • I don't think that the ability to alter the data produced by another component was something we wanted explicitly in the first place and this is what this approach is now making explicit. The original intention of these two hooks was to add new fields not alter existing ones even though I agree that the hook signature makes it possible to do so. Altering others field I think falls in the use at your own risk disclaimer.

"Adding a new field" is not so different from "altering the data of an existing field" or "altering the definition of an existing field". They are all in the category of "alter the field structure created by the content-handler".

Here is how I think of it:

Let's say that the content handler registered for a page is RegisteredContentHandler. RegisteredContentHandler gets to define any field structure it wants and populate those fields any way it wants. If RegisteredContentHandler is derived from ParentContentHandler, then it makes sense for RegisteredContentHandler::getXXXXForSearchIndex() (where XXXX is Fields or Data) to call parent::getXXXXForSearchIndex(), if it calls it at all, before it does any of its own work. RegisteredContentHandler may only want some of the parent's fields, or it may want to change some fields' types or data. The two important points are:

  1. ParentContentHandler is not responsible for understanding or accommodating the behaviors of all its child classes (may they live long and prosper).
  2. RegisteredContentHandler should have the last word on what it emits from getXXXXForSearchIndex().

Now along comes SearchDataForIndexHook. The purpose of this hook is to allow an extension to modify the results of getDataForSearchIndex() for any kind of content-handler. It doesn't really matter if the modification is adding a (hopefully) new field or changing data already produced by the content-handler. The two important points are:

  1. Content-handlers are not responsible for understanding or accommodating the behaviors of hooks implemented by random extensions, both present and future.
  2. The extension(s) implementing SearchDataForIndexHook is(are) responsible for what finally comes out of getDataForSearchIndex() (and they are responsible for playing nicely with each other).

Thus, it makes sense for the hook to be invoked after a RegisteredContentHandler has done whatever it wants to do, because it is the responsibility of the hook to accommodate the behavior of the content-handler, not the other way around.

The API should facilitate that, but as it stands now, the hook is executed at some ambiguous point typically in the middle of a content-handler's processing.

I am trying to write an extension that adds more text to the file_text field via the SearchDataForIndexHook

Could you describe more the use-case you want to achieve? Adding data to the file_text field sounds like an implementation detail to achieve some higher level goals and I think it would helpful to describe them.

In my specific use-case, I want to be able to do a full-text search over all the files uploaded to a wiki (e.g., .doc files, .odt, .txt, .xml, whatever, etc). To accomplish this, I am creating an extension that will query a Tika server to extract text from files --- by default, this extension will send any file to Tika. (Tika's purpose in life is to figure out how to extract text from files, including doing OCR --- so it will even try to read text printed in bitmap images, which is awesome, because it can extract useable text from, e.g., image-only PDF's that have no embedded text.) The questions before me are (1) when to query Tika, and (2) how to enable searching over the extracted text.

FileContentHandler will already provide a file's extracted text to search engines in a file_text field, if the file has a MediaHandler which knows how to extract text (i.e., with a functional implementation of MediaHandler::getPageText()). CirrusSearch is already set up to treat the file_text field as extracted text and utilize it in full-text searches. Presumably any future alternate search engine implementations would also treat a file_text field the same way. My extension will be providing extracted text from files, matching the existing semantics of file_text, so it makes sense to just use that field, and to use the SearchDataForIndexHook to do so. And then, it makes sense to have the hook perform the Tika query to get the text. So, my two questions are answered.

But... the hook gets run before FileContentHandler sets the value of the file_text field, and FileContentHandler doesn't pay any attention (rightfully so) to any pre-existing value in a field that belongs to FileContentHandler. So, if a file's MediaHandler::getPageText() returns false (which is the case for every file that my extension really cares about, i.e., every file that is not a PDF with embedded text), then FileContentHandler tells the search engine, "The file_text is false.", end of story.

If SearchDataForIndexHook were consistently executed after a content-handler did its work, then my extension would just append the Tika-extracted text to whatever was already produced by FileContentHandler (since it is the extension's responsibility to cooperate with the content-handler) and the job would be done.

SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).

I'm not sure I understand the problem here? Could you describe more what is the problem that it causes?

The problem here involves:

  1. How getFieldsForSearchIndex() and getDataForSearchIndex() are used.
  2. The asymmetry between these two get methods and the two hooks.

All I know about (1) comes from the only call to getDataForSearchIndex() that I can find, which is in CirrusSearch\BuildDocument\ParserOutputPageProperties::finalizeReal:

...
    $fieldDefinitions = $contentHandler->getFieldsForSearchIndex( $engine );
    $fieldContent = $contentHandler->getDataForSearchIndex( $page, $output, $engine );
    ...
    foreach ( $fieldContent as $field => $fieldData ) {
        $doc->set( $field, $fieldData );
        if ( isset( $fieldDefinitions[$field] ) ) {
            $hints = $fieldDefinitions[$field]->getEngineHints( $engine );
            CirrusIndexField::addIndexingHints( $doc, $field, $hints );
        }
    }
...

Here we see getFields.. is called to get the field definitions, getData.. is called to get the data for each field, and then finalizeReal() tries to use information from the definitions to do something extra for each data-bearing field.

The intent for using the two methods appears to be to use them as a pair, to get both the data and definitions for a set of fields and do whatever needs to be done with them.

What happens if an extension adds a new field? The extension can use the SearchDataForIndexHook to pass the data through getDataForSearchIndex(). But... what about the definition for the new field? The caller of getFields.. (e.g., finalizeReal()) will never see that. That seems to be a bug.

It seems to me that the SearchIndexFieldsHook should instead be a SearchFieldsForIndexHook, called by getFieldsForSearchIndex() with an additional $contentHandler parameter. An extension wanting to add (or etc) a new field can then implement both hooks and that work will be self-consistently appear in the results of getFields.. and getData...

Since there are multiple content handlers the SearchEngine calls getFieldsForSearchIndex on each of those then calls the hook for extensions to add their own fields and then merge all this, calling the hook for every possible ContentHandler would mean it's called multiple times: once per ContentHandler.

There is no requirement that content-handlers must define exclusive sets of fields, right? E.g., WikitextContentHandler and CodeContentHandler both return the same fields that they get from their common parent TextContentHandler. And thus, SearchEngine::getSearchIndexField() already does something about merging same-named field reported by multiple content-handlers.

So, yes, the revised SearchFieldsForIndexHook would not be run directly by SearchEngine::getSearchIndexField(); it would be run implicitly by each content-handler (and may even return different results per content-handler). The extension(s)' fields will be included in each content-handler's fields --- and SearchEngine will merge them just as it already does. (And, of course, the fields-hook should be run at the end of a content-handler's getFieldsForSearchIndex()!)

Why I commented that dealing with SearchIndexFieldsHook is more involved than SearchDataForIndexHook:

  • I think the data-hook patch I wrote could be safely applied to MW 1.35+ without adverse effects --- it doesn't really change the API so much as clarifies it (resolving an ambiguity in the contract to correct it, without changing any behavior that any existing code relies on).
  • Fixing the fields-hook issue properly would involve a big enough change to the API (both the hook name and the new parameter) that it could involve deprecation, etc.

...

[...]
"Adding a new field" is not so different from "altering the data of an existing field" or "altering the definition of an existing field". They are all in the category of "alter the field structure created by the content-handler".

I think it is different as it breaks isolation between the components and makes it explicitly order dependent. You are no longer responsible for your own fields but also possibly responsible for all the fields declared before you.

[...]

Now along comes SearchDataForIndexHook. The purpose of this hook is to allow an extension to modify the results of getDataForSearchIndex() for any kind of content-handler. It doesn't really matter if the modification is adding a (hopefully) new field or changing data already produced by the content-handler. The two important points are:

  1. Content-handlers are not responsible for understanding or accommodating the behaviors of hooks implemented by random extensions, both present and future.
  2. The extension(s) implementing SearchDataForIndexHook is(are) responsible for what finally comes out of getDataForSearchIndex() (and they are responsible for playing nicely with each other).

Thus, it makes sense for the hook to be invoked after a RegisteredContentHandler has done whatever it wants to do, because it is the responsibility of the hook to accommodate the behavior of the content-handler, not the other way around.

The API should facilitate that, but as it stands now, the hook is executed at some ambiguous point typically in the middle of a content-handler's processing.

I think the ambiguity comes from the fact that the SearchDataForIndexHook is called from the ContentHandler and gives the impression that there is a strong coupling between the two but I don't think it was meant to be so. SearchDataForIndexHook is just a means to gather additional fields an extension might want to add.

I am trying to write an extension that adds more text to the file_text field via the SearchDataForIndexHook

Could you describe more the use-case you want to achieve? Adding data to the file_text field sounds like an implementation detail to achieve some higher level goals and I think it would helpful to describe them.

In my specific use-case, I want to be able to do a full-text search over all the files uploaded to a wiki (e.g., .doc files, .odt, .txt, .xml, whatever, etc). To accomplish this, I am creating an extension that will query a Tika server to extract text from files --- by default, this extension will send any file to Tika. (Tika's purpose in life is to figure out how to extract text from files, including doing OCR --- so it will even try to read text printed in bitmap images, which is awesome, because it can extract useable text from, e.g., image-only PDF's that have no embedded text.) The questions before me are (1) when to query Tika, and (2) how to enable searching over the extracted text.

Thanks for explaining this, I hope that you can open-source this extension as it is a highly demanded feature!

FileContentHandler will already provide a file's extracted text to search engines in a file_text field, if the file has a MediaHandler which knows how to extract text (i.e., with a functional implementation of MediaHandler::getPageText()). CirrusSearch is already set up to treat the file_text field as extracted text and utilize it in full-text searches. Presumably any future alternate search engine implementations would also treat a file_text field the same way. My extension will be providing extracted text from files, matching the existing semantics of file_text, so it makes sense to just use that field, and to use the SearchDataForIndexHook to do so. And then, it makes sense to have the hook perform the Tika query to get the text. So, my two questions are answered.

I have few questions to make sure we don't commit to the wrong solution:

  • Have you considered implementing a MediaHandler wrapper so that MediaHandler::getPageText() would return directly the content returned by Tika? At a glance it seems feasible, a TikaMediaHandler could wrap existing handlers and declare itself as willing to support all the mime-types supported by the tika installation? The Mediawiki Service has nice tools to manipulate existing services and MediaHandlerFactory could be wrapped easily using addServiceManipulator().

SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).

I'm not sure I understand the problem here? Could you describe more what is the problem that it causes?

[...]
The intent for using the two methods appears to be to use them as a pair, to get both the data and definitions for a set of fields and do whatever needs to be done with them.

I think this is the heart of the confusion, they do not go as pairs as I don't think the original intention of these hooks was to let extensions tune ContentHandler behaviors.

Thanks for taking the time to explain your approach!

Again, in semi-scrambled order...

Thanks for taking the time to explain your approach!

You're welcome. I know MW is a huge project with a lot of moving parts, and any changes to the core need to be approached with healthy skepticism, so I might as well try to be as clear and complete as I can.

...

In my specific use-case, I want to be able to do a full-text search over all the files uploaded to a wiki (e.g., .doc files, .odt, .txt, .xml, whatever, etc). To accomplish this, I am creating an extension that will query a Tika server to extract text from files --- by default, this extension will send any file to Tika. (Tika's purpose in life is to figure out how to extract text from files, including doing OCR --- so it will even try to read text printed in bitmap images, which is awesome, because it can extract useable text from, e.g., image-only PDF's that have no embedded text.) The questions before me are (1) when to query Tika, and (2) how to enable searching over the extracted text.

Thanks for explaining this, I hope that you can open-source this extension as it is a highly demanded feature!

Yes and yes! That is definitely the plan. It is incredibly satisfying to see all kinds of new documents suddenly start appearing in wiki searches; I look forward to sharing that joy.

I have few questions to make sure we don't commit to the wrong solution:

  • Have you considered implementing a MediaHandler wrapper so that MediaHandler::getPageText() would return directly the content returned by Tika? At a glance it seems feasible, a TikaMediaHandler could wrap existing handlers and declare itself as willing to support all the mime-types supported by the tika installation? The Mediawiki Service has nice tools to manipulate existing services and MediaHandlerFactory could be wrapped easily using addServiceManipulator().

Implementing a MediaHandler wrapper was, in fact, my original approach to this task, before I stumbled upon SearchDataForIndexHook. (I had started out by looking at what PdfHandler does.) This approach is possible, but it turns really hacky really fast:

  • Define a wrapper TikaMediaHandler which:
    • implements a getPageText() method which queries the Tika server;
      • ...and hope that no one besides a search engine ever decides to call it, because the Tika query will be relatively expensive --- e.g., something to avoid doing more than once per file.
    • implements all the other public MediaHandler methods as forwarding to another handler instance (i.e., the "wrap existing handlers" part).
      • Is there some magical way to accomplish this in PHP, aside from writing out each individual method, that I don't know about? Maintaining wrapper classes does not seem trivial.
  • Define a wrapper TikaMediaHandlerFactory which:
    • is constructed with an instance of a pre-existing MediaHandlerFactory;
    • when queried for a handler for some mimetype:
      • tries to get a handler from the pre-existing factory;
      • returns a TikaMediaHandler, which possibly wraps the pre-existing-factory's handler.
  • Hook up a service-manipulator which replaces the existing MediaHandlerFactory with a TikaMediaHandlerFactory
    • ...and hope that this interacts correctly with any other extensions which are adding service-manipulators poking at MediaHandlerFactory.

(If you are really that concerned about things not being used for their "original intentions", all that wrapping/monkey-patching should be making you nauseous.)

By comparison, using the SearchDataForIndexHook approach:

  • Define an onSearchDataForIndexHook method which:
    • Queries Tika and puts the result in the file_text field.
  • Register the hook in extension.json.

The hook approach is a lot less code, and much more robust code.

But, the hook approach requires that the hook reliably runs after the ContentHandler, at the end of getDataForSearchFields()....

The API should facilitate that [late-running of the hook], but as it stands now, the hook is executed at some ambiguous point typically in the middle of a content-handler's processing.

I think the ambiguity comes from the fact that the SearchDataForIndexHook is called from the ContentHandler and gives the impression that there is a strong coupling between the two but I don't think it was meant to be so. SearchDataForIndexHook is just a means to gather additional fields an extension might want to add.
...
... I don't think the original intention of these hooks was to let extensions tune ContentHandler behaviors.

It's not about tuning content-handler behaviors per se; it's about injecting changes into the structured data fed to the search engine, without having to write "wrapper content-handlers" just to override two methods out of many just because those two methods happen to be in ContentHandler.

Regardless of the original intentions and use-cases (which are satisfied by the current interface), tweaking the interface such that the hook reliably runs after the get***ForSearchIndex() methods, as I have proposed, will:

  • not break any existing code;
  • reduce the coupling between the hook and the content-handlers (since the hook becomes clearly the last stage in producing data-for-search-indexing);
  • make subclassing ContentHandler less fragile (since subclasses will no longer be responsible for ensuring that the hook is run);
  • make the hook more useful (e.g., enable elegant implementations of cool new extensions, like the one I am writing).

As to the other problem with the current API:

SearchIndexFieldsHook is not run by getFieldsForSearchIndex() at all, which means that the search engine will not necessarily get a coherent description of the field structure that accounts for the work of both the content handler and the extension(s).

I'm not sure I understand the problem here? Could you describe more what is the problem that it causes?

[...]
The intent for using the two methods appears to be to use them as a pair, to get both the data and definitions for a set of fields and do whatever needs to be done with them.

I think this is the heart of the confusion, they do not go as pairs as I don't think the original intention of these hooks was to let extensions tune ContentHandler behaviors.

The original intentions for the hooks don't preclude the existence of the current bug in the get methods:

  • getDataForSearchIndex() and getFieldsForSearchIndex() are intended to be used as a pair.
  • These two methods are intended to return two aspects of the same set of fields: getData..() returns the values for a set of fields, and getFields..() returns the definitions for those same fields.
  • In the presence of SearchDataForIndexHook, the two methods can become mismatched: the set of fields returned by getData..() can be different from the set of fields returned by getFields..().
  • In particular, if SearchDataForIndexHook adds new fields to the results of getData..(), those fields will be missing from the results of getFields..().

That's the bug, and the root cause of the bug is a design oversight: SearchDataForIndexHook is missing its pair/counterpart, SearchFieldsForIndexHook, that mirrors the relationship between the getData..() and getFields..() methods.

[...]
Implementing a MediaHandler wrapper was, in fact, my original approach to this task, before I stumbled upon SearchDataForIndexHook. (I had started out by looking at what PdfHandler does.) This approach is possible, but it turns really hacky really fast:

  • Define a wrapper TikaMediaHandler which:
    • implements a getPageText() method which queries the Tika server;
      • ...and hope that no one besides a search engine ever decides to call it, because the Tika query will be relatively expensive --- e.g., something to avoid doing more than once per file.
    • implements all the other public MediaHandler methods as forwarding to another handler instance (i.e., the "wrap existing handlers" part).
      • Is there some magical way to accomplish this in PHP, aside from writing out each individual method, that I don't know about? Maintaining wrapper classes does not seem trivial.

I think most IDE should give you the ability to generate delegate methods.

  • Define a wrapper TikaMediaHandlerFactory which:
    • is constructed with an instance of a pre-existing MediaHandlerFactory;
    • when queried for a handler for some mimetype:
      • tries to get a handler from the pre-existing factory;
      • returns a TikaMediaHandler, which possibly wraps the pre-existing-factory's handler.
  • Hook up a service-manipulator which replaces the existing MediaHandlerFactory with a TikaMediaHandlerFactory
    • ...and hope that this interacts correctly with any other extensions which are adding service-manipulators poking at MediaHandlerFactory.

I think this is what MW addServiceManipulator offers and is made for.

(If you are really that concerned about things not being used for their "original intentions", all that wrapping/monkey-patching should be making you nauseous.)

I don't see anything too hackish here as the MediaHandler seems to be the right place to do what you want: "Extract text from a media file". But I agree that the MediaHandler abstract class seems a bit daunting.

By comparison, using the SearchDataForIndexHook approach:

  • Define an onSearchDataForIndexHook method which:
    • Queries Tika and puts the result in the file_text field.
  • Register the hook in extension.json.

The hook approach is a lot less code, and much more robust code.

Indeed it is simpler and lot less code on your side but this can hardly be a valid reason to put the burden on MW core with a ContentHandler change for which I don't think the use-case is justified (give power to the hook to mutate everything in the fields array) and I think goes against some concepts explored to re-design MW hooks (see T212482):

  • Filter hooks must not do expensive work
  • Filter hooks are given only one mutable value
  • Filter hooks must be deterministic and must not cause side-effects

IMO decorating existing MediaHandler sounds like a no-brainer:

  • seems relatively robust because this is the role of MediaHandler implementations
  • does not require MW core modification
  • does not require to back-port to previous MW versions if you plan to support older versions
  • I think that calling a MediaHandler methods is expected to be slow or costly, esp. get*Text() methods.

[...]

That's the bug, and the root cause of the bug is a design oversight: SearchDataForIndexHook is missing its pair/counterpart, SearchFieldsForIndexHook, that mirrors the relationship between the getData..() and getFields..() methods.

I think I disagree that this is a bug as there is nothing explicitly mentioning that SearchDataForIndexHook can be used to mutate existing values in the fields array and that SearchDataForIndexHook/SearchFieldsForIndexHook hooks should be paired with a particular ContentHandler method.

...

Indeed it is simpler and lot less code on your side but this can hardly be a valid reason to put the burden on MW core with a ContentHandler change for which I don't think the use-case is justified (give power to the hook to mutate everything in the fields array)...

Got it; I am resigned to not trying to use SearchDataForIndexHook for my particular use-case.

...and I think goes against some concepts explored to re-design MW hooks (see T212482):

  • Filter hooks must not do expensive work
  • Filter hooks are given only one mutable value
  • Filter hooks must be deterministic and must not cause side-effects

However, the two core issues of this bug report are still issues (independent of how I came along to post them):

  1. Running SearchDataForIndexHook is an interface invariant of ContentHandler::getDataForSearchIndex(), and the ContentHandler base class should enforce this invariant (and not let derived classes be left in charge of how/when/if to run the hook), and it should run the hook at a consistent place in the execution of getDataForSearchIndex().
  2. If ContentHandler::getFieldsForSearchIndex() is a necessary method that search-engines are supposed to use, then it is missing a SearchFieldsForIndexHook, without which it may return incomplete results. Conversely, if the complete/correct information is already available via another method, then the broken getFieldsForSearchIndex() should be removed altogether and search-engines should use the other method.

(1) has nothing to do with what a hook even does (and is consistent with the discussion in T212482). However, where the brokenness of (1) falls on the spectrum of "Meh, whatever." -> "Smells funny..." -> "Subtle bugs waiting to happen" -> "Never do this!" is indeed debatable. Maybe it is not worth fixing up anytime soon. The design pattern I used in my submitted patch is one way to fix it, and that design pattern is already used elsewhere in the MW core code (though certainly not consistently, even within single interfaces).

(2) is a concrete error in the ContentHandler interface. Whether or not it causes a user-visible bug will depend on how much a search-engine relies on getFieldsForSearchIndex() and how much behavior an extension imbues into its new field definitions. (E.g., maybe searches on coordinates via the GeoData extension are already subtly broken, maybe not?)

...
I think I disagree that this is a bug as there is nothing explicitly mentioning that SearchDataForIndexHook can be used to mutate existing values in the fields array and that SearchDataForIndexHook/SearchFieldsForIndexHook hooks should be paired with a particular ContentHandler method.

(2) has nothing to do with mutating field values; the problem specifically relates to adding new fields and the definitions for those fields. I don't know what else I can write to explain how (2) is a bug, beyond what I have already written. If you have any questions, I'll be happy to try to answer them.

(FYI: (2) has nothing to do with my own "modify file_text value" use-case. I just happened to notice it along with (1), and I think that when it is fixed, the fix should also involve the ContentHandler baseclass enforcing the invariant of calling the hook, at a specific time, etc, etc.)

[...]
Implementing a MediaHandler wrapper was, in fact, my original approach to this task, before I stumbled upon SearchDataForIndexHook. (I had started out by looking at what PdfHandler does.) This approach is possible, but it turns really hacky really fast:
...
(If you are really that concerned about things not being used for their "original intentions", all that wrapping/monkey-patching should be making you nauseous.)

I don't see anything too hackish here as the MediaHandler seems to be the right place to do what you want: "Extract text from a media file". But I agree that the MediaHandler abstract class seems a bit daunting.

6 abstract methods, and 37 public methods marked "@stable to override"! (And, getEntireText() isn't even one of them. :) )

(The whole "wrap existing handlers, to enable searching on OCR output of bitmap images" is an inherently hacky stretch-goal anyway. There is less work involved in creating a very basic catch-all MediaHandler implementing only getPageText(), constructed for any file without an existing handler --- e.g., OFFICE files.)

I think most IDE should give you the ability to generate delegate methods.

Alas, I do all my coding using vi on a pair of side-by-side VT100 terminals.

My IDE is my imagination.

Imagination!

Thanks for filling this bug and going through the process of explaining why you chose this particular solution.
While I agree that the current design of SearchDataForIndexHook & SearchIndexFieldsHook is clearly problematic I believe that the direction you take is not the right one as it couples even more these two hooks with ContentHandler and introduce temporal coupling to expose even more data to the hook which is going against upcoming redesign of the MW hook system.

The lack of symmetry between the call sites of these hook might cause subtle bugs I agree but mainly because CirrusSearch calls ContentHandler#getFieldsForSearchIndex() while indexing documents for discovering engine hints. If an extension decides to declare a new field that requires engine hints then I believe these hints might be ignored by CirrusSearch but this should probably be a dedicated bug report.

Change 734801 abandoned by DCausse:

[mediawiki/core@master] content: Ensure SearchDataForIndexHook runs at end of getDataForSearchIndex()

Reason:

per discussion on T294405

https://gerrit.wikimedia.org/r/734801