Page MenuHomePhabricator

Error: Class 'MediaWiki\HookContainer\HookRunner' not found
Open, LowPublicPRODUCTION ERROR

Description

Error
labels.normalized_message
[{reqId}] {exception_url}   Error: Class 'MediaWiki\HookContainer\HookRunner' not found
error.stack_trace
from /srv/mediawiki/php-1.41.0-wmf.28/includes/exception/MWExceptionHandler.php(811)
#0 /srv/mediawiki/php-1.41.0-wmf.28/includes/exception/MWExceptionHandler.php(339): MWExceptionHandler::logError(ErrorException, string, string, string)
#1 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#2 /srv/mediawiki/src/Profiler.php(342): trigger_error(string, integer)
#3 /srv/mediawiki/src/Profiler.php(247): Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(array, array, string)
#4 /srv/mediawiki/php-1.41.0-wmf.28/includes/page/Hook/IsFileCacheableHook.php(25): Wikimedia\MWConfig\Profiler::Wikimedia\MWConfig\{closure}(ExcimerLog)
#5 /srv/mediawiki/php-1.41.0-wmf.28/includes/AutoLoader.php(221): require(string)
#6 [internal function]: AutoLoader::autoload(string)
#7 /srv/mediawiki/php-1.41.0-wmf.28/includes/HookContainer/HookRunner.php(42): spl_autoload_call(string)
#8 /srv/mediawiki/php-1.41.0-wmf.28/includes/AutoLoader.php(221): require(string)
#9 [internal function]: AutoLoader::autoload(string)
#10 /srv/mediawiki/php-1.41.0-wmf.28/includes/MediaWikiServices.php(339): spl_autoload_call(string)
#11 /srv/mediawiki/php-1.41.0-wmf.28/includes/Setup.php(351): MediaWiki\MediaWikiServices::getInstance()
#12 /srv/mediawiki/php-1.41.0-wmf.28/includes/WebStart.php(92): require_once(string)
#13 /srv/mediawiki/php-1.41.0-wmf.28/index.php(44): require(string)
#14 /srv/mediawiki/w/index.php(3): require(string)
#15 {main}
Impact
  • requests fail randomly, at a very low rate (less than once per day on average)
Notes

Details

Request URL
https://en.wikipedia.org/?curid=*

Event Timeline

I would assume there is a previous exception (logged - or may not logged due to this exception - with E_USER_WARNING from the wmf Profiler.php in operations) as the mention line 811 in MWExceptionHandler calls the HookRunner and is trying to load the class, that fails for some reasons and seeing this message, not the previous failure.

IsFileCacheableHook is a interface in the class MediaWiki\HookContainer\HookRunner, when one interface fails to load, the class would not be found.

This is still regularly seen. Logstash: mediawiki dashboard message:HookRunner AND message:"not found"

  • codfw, mw-web (kubernetes), login.wikimedia.org [d2e15ff8-f4e4-4f3d-bae4-6556defc28a9] /wiki/Special:CentralAutoLogin/checkLoggedIn?… Error: Class 'MediaWiki\HookContainer\HookRunner' not found
  • codfw, jobrunner (mw2279), commons.wikimedia.org, [8766276b-ce0f-44d7-8149-4349707ce881] /rpc/RunSingleJob.php Error: Class 'MediaWiki\HookContainer\HookRunner' not found
  • eqiad, mw-api-ext (kubernetes), wikidata.org, [6fbf7a29-c38e-4a99-b393-a50c8c703b3c] /w/api.php?action=wbgetentities&… Error: Class 'MediaWiki\HookContainer\HookRunner' not found

Analyzing the data in Logstash, it seems to me:

  • Likely not due to specific components like CentralAuth or Wikibase, because it also happens during requests that involve other components. The only common factor semes to be HookRunner.
  • Likely not due to a general server or infra issue, because it only happens with HookRunner and not with other classes.

I think something is going on with eximer sampling. Line 25 in IsFileCacheableHook.php is an empty line at the end of the file, and there is no executable code in that file. IsFileCacheableHook depends on Article, but the declaration by itself wouldn't even trigger autoloading.

The only way I see to get from IsFileCacheableHook to Profiler::excimerFlushToArclamp is through whatever magic eximer uses to interrupt the normal flow of execution and take a sample of the stack trace. When that fails (due to an issue talking to Redis?) we try to report an error, but that fails because we are in some strange state outside the regular PHP runtime.

Maybe @tstarling has an idea.

daniel triaged this task as High priority.Feb 29 2024, 4:45 PM
FJoseph-WMF raised the priority of this task from High to Needs Triage.Mar 19 2024, 6:03 PM

@tstarling any thoughts on next steps for this item? Can we move it to low priority?

The frequency of these errors before March 6 g1008752 reflected the frequency of Redis connection timeout errors. About 1 in every 160 errors raised by excimerFlushToArclamp() would fail with "HookRunner not found". So following Timo's patch, we can expect this to be very rare indeed. In logstash since that date, there has only been one such error. So I think it is a low priority.

In principle it should be possible to raise an error from any context. Error handling shouldn't fail.

The stack trace shows the autoloader is already loading HookRunner.php and then HookRunner is required again a second time. Presumably production spends 1/160th of its time in the HookRunner file scope, so there is a 1/160 chance that a given excimer log flush will not be able to load HookRunner.

MWExceptionHandler could be more defensive about loading classes, or it could load HookRunner on install.

excimerFlushToArclamp() could check whether the autoloader is in the stack, and if it is, it could queue the event rather than sending it immediately.

Tagging the platform team, since this seems to be abut making MWExceptionHandler more defensive.

This is caused by the

( new HookRunner( MediaWikiServices::getInstance()->getHookContainer() ) )->onLogException( $e, $suppressed );

line that's called after an exception is logged, ie. it does not break exception logging, it just logs another exception afterwards (and interrupts request processing if the original exception didn't already do so). Not great but relatively low-impact.

I guess we could check if HookRunner is loaded; not great though because the point of the LogException hook is to facilitate custom error logging, so on sites using that such errors would get silently ignored. The nice solution would be to 1) document what components MWExceptionHandler is allowed to use, 2) make sure these are loaded before MWExceptionHandler is set up.

This is caused by

( new HookRunner( MediaWikiServices::getInstance()->getHookContainer() ) )->onLogException( $e, $suppressed );

[…] The nice solution would be to 1) document what components MWExceptionHandler is allowed to use, 2) make sure these are loaded before MWExceptionHandler is set up.

HookRunner and HookContainer nowadays depends on hundreds of interfaces and the service container, which essentiallys require nearly all core components to be loaded to an extent. Pre-emptively loading that all does not seem feasible, unless we're okay with replacing this problem with another one: We need a new error handler for the significant portion of the request where we are loading all of that.

Our hooks have evolved a lot since they were first introduced. I suspect that, when onLogException was introduced, it's quite likely that Hooks still used global variables. This means that they naturally work out of the box with an empty state, and close to zero dependencies even within core. If/when/as more core components load, and more extensions load, they added themselves to the Hooks registry. Which means you naturally get the maximum functionality, without over-coupling or breaking/depending on things before you can use them. That is, when we failed before something loads, we report as best we can without that loading, and after something loads, the loaded thing can potentially customise and handle that error. That simpler reality is still the one we have in several areas where we embrace the static nature of requests in a CGI application. But for hooks it isn't how things work anymore.

In any event, even in these before times, there has always been the gap before Setup.php is finished where the GUI/MWExceptionHandler isn't setup yet (where PHP will render a 5xx instead, and emit details to an error.log depending on webserver config), and there is inherently another gap between ExtensionRegistry and MWExceptionHandler (although today that gap is in the opposite direction from what I would have assumed, oddly we finish nearly all of Setup and ExtensionRegistry, before installing MWExceptionHandler).

There is an unconditional MediaWikiServices/HookRunner/HookContainer call almost immediately after installing MWExceptionHandler in Setup.php:

Setup.php



ExtensionRegistry::getInstance()->loadFromQueue();
ExtensionRegistry::getInstance()->finish();
MediaWikiServices::allowGlobalInstance();
define( 'MW_SERVICE_BOOTSTRAP_COMPLETE', 1 );

MWExceptionHandler::installHandler();

$wgHooks = new ( MediaWikiServices::getInstance()->getHookContainer(),  );





( new HookRunner( MediaWikiServices::getInstance()->getHookContainer() ) )->onSetupAfterCache();



This means most of the setup is already out of reach for MWExceptionHandler (and the onLogException hook), and the gap between it being installed and confirmed to have HookRunner and all services available, appears to be very tiny indeed. Basically, immediately after installing MWExceptionHandler, we go straight into the "mega" load sequence of all the core components (MediaWikiServices) and all the interfaces (HookRunner). We could intentionally re-order this (which is effectively the same as making one eagerly load the other), but given we are talking about a failure that happens while loading this very thing, would not increase coverage of the hook, it would decrease it. It would make this entire phase out of reach for MWExceptionHandler, and this by extent out of reach for Monolog and Logstash detailed reporting, in favour of very basic php-fpm stderr logs (no stack trace, close to no Logstash visibility). That seems like a regression.

I'd lean towards making this hook conditional on MediaWikiService::hasInstance() && class_exists( 'HookRunner', autoload: false ), or removing support for the hook. The thinking being that this seems too large a large of the setup phase to lose our primary error handler for.

HookRunner and HookContainer nowadays depends on hundreds of interfaces and the service container, which essentiallys require nearly all core components to be loaded to an extent.

HookRunner does, but HookContainer doesn't. HookContainer actually doesn't depend on much at all, except maybe by way of calling wfDeprecated. HookRunner is not needed to call hooks, it's just a wrapper for type safety and convenience.