Page MenuHomePhabricator

Create RequestTimeout library
Closed, ResolvedPublic

Description

The current theory is that T193565 is caused by timeout exceptions being thrown from the Excimer timeout handler after a DB-related function returns, then caught and discarded by DeferredUpdates. Then DeferredUpdates continues with its loop as if nothing happened, causing further problems when the Database object is in an unexpected state.

So there are two problems:

  • DeferredUpdates should probably stop when it runs out of time, rather than continuing to run forever. Although maybe a different time limit should apply after fastcgi_finish_request() is called. If it is running forever on purpose, we probably shouldn't break a random update within the sequence by delivering a timeout exception to it.
  • Handling timeouts within Database would be much simpler if we could declare critical sections, causing timeouts to be queued until the critical section exits.

The Excimer timeout handler is a few lines of code in WMF config, so there is no interface to allow MediaWiki core to interact with it. My proposal is to split it out into a RequestTimeout library:

// WMF config
use Wikimedia\RequestTimeout\RequestTimeout;
$rt = RequestTimeout::singleton();
$rt->setWallTimeLimit( $wmgTimeLimit );

// Setup.php
if ( $wgRequestTimeout && !$wgCommandLineMode && !$rt->isStarted() ) {
    $rt->setWallTimeLimit( $wgRequestTimeout );
}

// Critical section
$id = __METHOD__;
$csp = $rt->createCriticalSectionProvider( /* emergency timeout */ 5 );
$csp->enter( $id );
critical_thing();
$csp->exit( $id );

// Querying
if ( $rt->getWallTimeRemaining() > 5 ) {
    do_slow_thing();
} else {
    do_fast_thing();
}

// Let MW configure the emergency timeout:
$csp = MediaWikiServices::getInstance()->getCriticalSectionProvider();
$csp->enter( __METHOD__ );
...
$csp->exit( __METHOD__ );

// Run forever with watchdog timer
while ( true ) {
    $rt->setWallTimeLimit( 5 );
    do_thing();
}

// Respect timeout exceptions
use Wikimedia\RequestTimeout\TimeoutException;
try {
    do_thing();
} catch ( TimeoutException $e ) {
    throw $e;
} catch ( Throwable $e ) {
    // ignore
}

The library would have fallback functionality if the excimer extension is not present. Setting a time limit would fall back to set_time_limit(). So unlike Excimer itself, RequestTimeout would always be a singleton. I think a singleton is better than static functions because a singleton lets you swap the whole implementation, either to switch between Excimer and set_time_limit(), or for testing.

Event Timeline

This is in progress. I'll probably have a basic initial commit up tomorrow. It's feature-complete per the above synopsis with only ~300 LOC (excluding blank/comment), so it's no big deal.

Task description edit: I removed the idea of scopedEnter(), i.e. scoped critical sections, since throwing an exception from a destructor is dicey. It's a fatal error during request shutdown.

I'm still thinking about re-adding scoped critical sections, because otherwise every critical section needs to be wrapped in try/finally, which is annoying. Yes a fatal error will be generated if a critical section scope is destroyed during request shutdown, but maybe that is the least bad option.

I wonder if the scope object should have an explicit exit function, to make it a bit more obvious in calling code. So most timeout exceptions will be thrown from the exit() call, not from __destruct().

Change 646899 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/libs/RequestTimeout@master] RequestTimeout library initial commit

https://gerrit.wikimedia.org/r/646899

Change 649664 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] Zuul: Install CI for mediawiki/libs/RequestTimeout

https://gerrit.wikimedia.org/r/649664

Change 649664 merged by jenkins-bot:
[integration/config@master] Zuul: Install CI for mediawiki/libs/RequestTimeout

https://gerrit.wikimedia.org/r/649664

Mentioned in SAL (#wikimedia-releng) [2020-12-15T16:13:30Z] <James_F> Zuul: Install CI for mediawiki/libs/RequestTimeout T269326

Change 646899 merged by jenkins-bot:
[mediawiki/libs/RequestTimeout@master] RequestTimeout library initial commit

https://gerrit.wikimedia.org/r/646899

If we want to deploy this and still use auto_prepend_file, we will have to install the composer autoloader in PhpAutoPrepend.php. And if we did that, which MediaWiki vendor directory would we get the autoloader from? We could give mediawiki-config its own vendor directory, and require this library, that's almost feasible. But it would be simpler and more robust to just start the timer from CommonSettings.php.

We could have php.ini set max_execution_time, but disable that timer during config and enable the library one -- that would at least protect us against infinite loops early in the MediaWiki setup.

My plan for core is to have $wgMaxRequestWallTime, null by default, and if it is non-null, start a timer from Setup.php. That would be usable for WMF at the expense of excluding some more code from the timer. But even if it's not used by WMF, it would still be useful to have as a reference implementation, to explain why RequestTimeout exceptions can be caught in core.

If we want to deploy this and still use auto_prepend_file, we will have to install the composer autoloader in PhpAutoPrepend.php. And if we did that, which MediaWiki vendor directory would we get the autoloader from? We could give mediawiki-config its own vendor directory, and require this library, that's almost feasible. But it would be simpler and more robust to just start the timer from CommonSettings.php.

Just for reference, mediawiki-config used to have its own vendor directory for wikimedia/cdb during the HHVM era. It worked but I think having a third autoloader in the stack wasn't the greatest and would suggest avoiding it if possible.

Change 654573 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] RequestTimeout library integration

https://gerrit.wikimedia.org/r/654573

I'm still thinking about re-adding scoped critical sections, because otherwise every critical section needs to be wrapped in try/finally, which is annoying. Yes a fatal error will be generated if a critical section scope is destroyed during request shutdown, but maybe that is the least bad option.

I wonder if the scope object should have an explicit exit function, to make it a bit more obvious in calling code. So most timeout exceptions will be thrown from the exit() call, not from __destruct().

When I wrote https://gerrit.wikimedia.org/r/c/mediawiki/core/+/644636 I was assuming that it would later plug into something like this system. I'm reasonably happy with explicit start/exit calls from callers in that patch, which also uses __destruct(). This was after trying various different schemes.

If we want to deploy this and still use auto_prepend_file, we will have to install the composer autoloader in PhpAutoPrepend.php.[…] We could give mediawiki-config its own vendor directory, […]

As Lego mentioned, we used to have this. It imho significantly lowered confidence in the production stack as there was no testing for how those libs interact with actual (any or multiple versions of) MW, and no detection or enforcement of conflicts, and rather difficult to upgrade indeed. We'd have to not only have forward- and backward-compat in the library, but also knowingly run prod with major version 2.0 when one of the core branches expects to run with 3.0, or vice versa. I am quite happy to not have that anymore.

[…] it would be simpler and more robust to just start the timer from CommonSettings.php. We could have php.ini set max_execution_time, but disable that timer during config and enable the library one -- that would at least protect us against infinite loops early in the MediaWiki setup.

+1 LGTM.

Do we need a "max_execution_time"-based fallback, though? These would not be throwing the expected exceptions, and (I think?) also could not render localised/skinned errors or use structured logging. If that's true, would it work to let our php-fpm or Apache level timeouts handle those early failures?

Change 654573 merged by jenkins-bot:
[mediawiki/core@master] RequestTimeout library integration

https://gerrit.wikimedia.org/r/654573

Change 672579 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Use the RequestTimeout library to set time limits

https://gerrit.wikimedia.org/r/672579

Change 672579 merged by jenkins-bot:
[operations/mediawiki-config@master] Use the RequestTimeout library to set time limits

https://gerrit.wikimedia.org/r/672579

Krinkle assigned this task to tstarling.