The current theory is that T193565 is caused by timeout exceptions being thrown from the Excimer timeout handler after a DB-related function returns, then caught and discarded by DeferredUpdates. Then DeferredUpdates continues with its loop as if nothing happened, causing further problems when the Database object is in an unexpected state.
So there are two problems:
- DeferredUpdates should probably stop when it runs out of time, rather than continuing to run forever. Although maybe a different time limit should apply after fastcgi_finish_request() is called. If it is running forever on purpose, we probably shouldn't break a random update within the sequence by delivering a timeout exception to it.
- Handling timeouts within Database would be much simpler if we could declare critical sections, causing timeouts to be queued until the critical section exits.
The Excimer timeout handler is a few lines of code in WMF config, so there is no interface to allow MediaWiki core to interact with it. My proposal is to split it out into a RequestTimeout library:
// WMF config use Wikimedia\RequestTimeout\RequestTimeout; $rt = RequestTimeout::singleton(); $rt->setWallTimeLimit( $wmgTimeLimit ); // Setup.php if ( $wgRequestTimeout && !$wgCommandLineMode && !$rt->isStarted() ) { $rt->setWallTimeLimit( $wgRequestTimeout ); } // Critical section $id = __METHOD__; $csp = $rt->createCriticalSectionProvider( /* emergency timeout */ 5 ); $csp->enter( $id ); critical_thing(); $csp->exit( $id ); // Querying if ( $rt->getWallTimeRemaining() > 5 ) { do_slow_thing(); } else { do_fast_thing(); } // Let MW configure the emergency timeout: $csp = MediaWikiServices::getInstance()->getCriticalSectionProvider(); $csp->enter( __METHOD__ ); ... $csp->exit( __METHOD__ ); // Run forever with watchdog timer while ( true ) { $rt->setWallTimeLimit( 5 ); do_thing(); } // Respect timeout exceptions use Wikimedia\RequestTimeout\TimeoutException; try { do_thing(); } catch ( TimeoutException $e ) { throw $e; } catch ( Throwable $e ) { // ignore }
The library would have fallback functionality if the excimer extension is not present. Setting a time limit would fall back to set_time_limit(). So unlike Excimer itself, RequestTimeout would always be a singleton. I think a singleton is better than static functions because a singleton lets you swap the whole implementation, either to switch between Excimer and set_time_limit(), or for testing.