Page MenuHomePhabricator

PHP web requests running for multiple hours
Open, Needs TriagePublic

Description

In the parent ticket we are debugging a problem with missing events from edit requests. The source of this problem appears to be that a request ran for multiple hours in post-send, and many (all?) of the deferred's timed out.

To see how common this is i wrote a script to query reqIds that logged EmergencyTimeoutException, and then did an aggregation query that filtered logs for the same (host, reqId) combo (to exclude jobs that reuse reqId) and reported the delta between the earliest and latest log message. This reports requests with > 10 minutes between start and end

Script: P69109
Results for Sep 1 - 11: P69110

It's not a crazy number of requests per day, on average < 10 with 20 on the worst day, but we have multiple requests per day that run for 2+ hours. The longest request runs for 173 minutes.

I should note that this depends on the initial request logging something. If the initial request didn't log anything and the logs start at the timeout they are likely not included here. Perhaps EmergencyTimeoutException could be adjusted to report the current request runtime in the error message to give more concrete information.

Event Timeline

My initial random guess at a proximal cause would be deferred updates blocking on insert locks at the db where they hang for a long time in iowait.

A big difference between php-fpm and HHVM is that PHP doesn't have natively a wall clock based timeout. So for instance max_execution_time only refers to time spent executing php bytecode, so not for things like db queries or http requests.

With the excimer php extension we've partially resolved this issue, meaning that excimer is capable to compute accurate enough wall clock timeouts for php requests, with the only limitation that the timeout can only be enforced when php code is being executed; so if a request hangs for 1 hour in a mysql query, excimer will know to kill it only when the call to the external library is done.

To contrast this, we also set request_terminate_timeout for php-fpm to 201 seconds by default. This *should* be a wall clock timeout that doesn't suffer from the aforementioned limitations, but it's also brutal and will not retrurn a nice error page to the user.

I should note though that we do have exceptions to the 201 seconds timeouts: videoscaler requests can run up to one day and jobrunners up to 1hour IIRC. So it would be useful first to find out if the original request was originating from a jobrunner.

I should note though that we do have exceptions to the 201 seconds timeouts: videoscaler requests can run up to one day and jobrunners up to 1hour IIRC. So it would be useful first to find out if the original request was originating from a jobrunner.

I've updated the script to report the url and referrer (technically the most common, but when aggregating over (reqId, host) they should all be the same) and updated both pastes. The url's are mostly either api.php with an action=edit referrer, or action=submit.

When the excimer timeout exception is thrown, MediaWiki replaces the response with an error page. However, the reason for having excimer at this level, is so that we can render a skinned and localised error page, which includes various bits of information from the database (e.g. sitenotice, notification count, etc). It also involves running whatever deferred updates were queued post-send.

As such, it is not impossible for an excimer timeout to happen "at the right time" but also have long execution, if there are badly behaved features queueing code to happen post-send.

In the first example of P69110, reqId: "8ee8a9cd-01a4-43ac-911f-ac7b511e74b7", there is no sign of any warning, error, or timeout during the request. The "RequestTimeout after 180 seconds" is logged 14 minutes after the first message in that request. However, what's interesting is that that timeout is coming from after the request has already been closed and the response already sent to the client. Note how the stack trace comes from doPostOutputShutdown, restInPeace, and DeferredUpdates::doUpdates.

The critical section "Wikimedia\Rdbms\Database::executeQuery" timed out after 180 seconds
…
…
#1 /srv/mediawiki/php-1.43.0-wmf.20/includes/libs/rdbms/database/Database.php(782): Wikimedia\Rdbms\DatabaseMySQL->doSingleStatementQuery(string)
…
#8 /srv/mediawiki/php-1.43.0-wmf.20/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(794): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#9 /srv/mediawiki/php-1.43.0-wmf.20/extensions/Echo/includes/UserLocator.php(151): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#10 /srv/mediawiki/php-1.43.0-wmf.20/extensions/Echo/includes/Hooks.php(819): MediaWiki\Extension\Notifications\UserLocator::getArticleAuthorByArticleId(int)
#11 /srv/mediawiki/php-1.43.0-wmf.20/includes/HookContainer/HookContainer.php(159): MediaWiki\Extension\Notifications\Hooks->onLinksUpdateComplete(MediaWiki\Deferred\LinksUpdate\LinksUpdate, int)
…
#17 /srv/mediawiki/php-1.43.0-wmf.20/includes/deferred/DeferredUpdates.php(302): MediaWiki\Deferred\DeferredUpdates::run(MediaWiki\Deferred\AutoCommitUpdate)
…
#24 /srv/mediawiki/php-1.43.0-wmf.20/includes/MediaWikiEntryPoint.php(674): MediaWiki\Deferred\DeferredUpdates::doUpdates()
#25 /srv/mediawiki/php-1.43.0-wmf.20/includes/MediaWikiEntryPoint.php(496): MediaWiki\MediaWikiEntryPoint->restInPeace()
#26 /srv/mediawiki/php-1.43.0-wmf.20/includes/MediaWikiEntryPoint.php(454): MediaWiki\MediaWikiEntryPoint->doPostOutputShutdown()
#27 /srv/mediawiki/php-1.43.0-wmf.20/includes/MediaWikiEntryPoint.php(209): MediaWiki\MediaWikiEntryPoint->postOutputShutdown()
#28 /srv/mediawiki/php-1.43.0-wmf.20/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#29 /srv/mediawiki/w/index.php(3): require(string)
#30 {main}

After this one, all other deferred updates quickly cancel and fail, with cleanup completing in under a second.

Regarding slow MySQL queries, the above is a simple select query, which afaik have a default query killer timeout of 30 seconds. If the query killer isn't enabled for some queries, that'd be an Rdbms bug. If it's not working, that might be worth asking DBA about.

Regarding long-running requests, we have many layers of HTTP timeouts for "less clean" shutdown, which exist precisely because we can't guruantee it from inside the CGI process. In theory, if multiple extensions execute slow logic via deferred, that could stack up to several minutes even after a fatal timeout or other exception has already been thrown. However, the logs don't support that being the case. The request ended before the first message at 14min mark, possibly in well under a minute? Then we encountered a timeout from 1 mysql query, and everything else after that cancels quickly.

My next questions would be:

  • Can we find how long one of these requests took to respond at HTTP level (before post-send starts).
  • Is the query killer instruction sent in the affected SQL query? (In the above example, it's a query from the Echo extension).
  • Is the query killer working for that query?
  • There are numberous app server timeouts listed at https://wikitech.wikimedia.org/wiki/HTTP_timeouts#App_server. There is indeed the infamous CPU-timeout max_execution_time in the PHP engine level, but there is also a "php-fpm: request_timeout Wall clock time: 201". Is this configured and working correctly?

When the excimer timeout exception is thrown, MediaWiki replaces the response with an error page. However, the reason for having excimer at this level, is so that we can render a skinned and localised error page, which includes various bits of information from the database (e.g. sitenotice, notification count, etc). It also involves running whatever deferred updates were queued post-send.

As such, it is not impossible for an excimer timeout to happen "at the right time" but also have long execution, if there are badly behaved features queueing code to happen post-send.

Yes, as I previously stated. But the reason why excimer timeouts can take a long time has nothing to do with what you wrote. It has to do with the fact when we're executing code from an external C library and/or that we're making a blocking io call, it's impossible for excimer to block that operation. The zend engine needs to take back control for the timeout to happen.

Regarding slow MySQL queries, the above is a simple select query, which afaik have a default query killer timeout of 30 seconds. If the query killer isn't enabled for some queries, that'd be an Rdbms bug. If it's not working, that might be worth asking DBA about.

The query killer isn't going to be able to quickly terminate a query if that's part of a deadlock, which I am suspecting could be the reason for these long-running requests. I will need to dig a bit deeper to ensure these are real data and not a result of bad accounting on our part.

My next questions would be:

  • Can we find how long one of these requests took to respond at HTTP level (before post-send starts).

You can look at the apache access log in logstash, which is sampled, or directly the mediawiki.httpd.accesslog on kafka-logging, which is unsampled.

  • There are numberous app server timeouts listed at https://wikitech.wikimedia.org/wiki/HTTP_timeouts#App_server. There is indeed the infamous CPU-timeout max_execution_time in the PHP engine level, but there is also a "php-fpm: request_timeout Wall clock time: 201". Is this configured and working correctly?

it is configured and unless there's a bug in the latest php 7.4, it should also work:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/_mediawiki-common_/global.yaml#153

which translates to setting this environment variable:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/mediawiki/templates/lamp/deployment.yaml.tpl#131

which is then used in the configuration file for php-fpm:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/php/7.4/fpm/conf/pool.d/FCGI_UNIX.conf#15

The only places where that value is higher are jobrunners and videoscalers, for obvious reasons. If you want to test it, change the value of php.timeout in mwdebug to something like 15 seconds, then use the fatal-error.php page (or inject a php file here) to verify it gets triggered.