HHVM request timeouts not working; support lowering the API request timeout per request
Closed, ResolvedPublic

Description

After an outage caused by expensive gallery tag expansions not timing out & being retried, we introduced a PHP API timeout of 290s. Subsequent changes lowered this to 60s. However, there is evidence both in this task, T152074 and T149421 that those timeouts do not actually work in an FCGI context. Expensive requests can and do pile up in HHVM, causing outages such as T151702.

Further tightening timeouts per API end point or request

Since the normal cost of different API end points differs by several orders of magnitude, a global upper bound like 60 seconds is unlikely to be useful for many clients. For example, in many situations users are likely to move on instead of waiting for 60s. More critically, avoiding retry amplification requires a coordination of timeouts in our infrastructure (as described in https://www.mediawiki.org/wiki/Rules_of_thumb_for_robust_service_infrastructure), which means that the time budget for the lowest level services is very limited.

It would be much better for overall system stability to support tighter timeouts per API end points.

One option would be to generally set up timeouts based on expected execution times, and clearly document these so that clients can set their timeouts slightly larger. Another option would be to allow clients to pass in a lower timeout.

There are a very large number of changes, so older changes are hidden. Show Older Changes
GWicke added a subscriber: BBlack.
GWicke added subscribers: faidon, Joe.

On the PHP level, I don't know of any straightforward mechanism to implement a timeout based on wall clock time as you're asking for here. set_time_limit works on CPU time, which wouldn't help in the reasonably common case where time spent waiting on the database is significant.

Or is the idea to set the timeout at a higher layer (IMO that sounds like a more reliable solution) via a custom HTTP request header?

GWicke updated the task description. (Show Details)Apr 25 2015, 4:05 PM
GWicke added a comment.EditedApr 25 2015, 4:24 PM

On HHVM, the timeout is set through max_execution_time. The current value can be retrieved with ini_get("max_execution_time"). It might be worth giving ini_set("max_execution_time", 60) a shot to see if the var can be updated at runtime.

Or is the idea to set the timeout at a higher layer (IMO that sounds like a more reliable solution) via a custom HTTP request header?

Clients could pass in the timeout in a header or query parameter.

The main goals we are trying to achieve by lowering the timeouts are

  1. free associated resources on timeout (free the thread, stop querying the DB), and
  2. signal a 503 to clients in a timely manner, so that they can detect that a server-side timeout was hit.

The signaling in 2) could be achieved with Apache too (which could be a good stop-gap), but I don't think that would help with 1).

On HHVM, the timeout is set through max_execution_time.

That's equivalent to set_time_limit() and works on CPU time, according to http://docs.hhvm.com/manual/en/info.configuration.php#ini.max-execution-time. In fact, set_time_limit() internally seems to actually do the ini_set() in Zend PHP.

But apparently HHVM has a configuration option to select wall time versus CPU time[1] that's not mentioned in the docs for set_time_limit() or max_execution_time? And we even set it in the change you linked. Then we should be good if we accept the different semantics for people running MediaWiki under Zend or with a different value for that setting.

Otherwise, a CPU time limit would work for limiting resources when it's something spinning in PHP but wouldn't help if PHP is mostly waiting on slow DB queries and wouldn't reliably work with a "set the requested timeout just under the client timeout" scheme.

[1]: I checked the source code to see if HHVM was also doing an ini_set internally, and found a branch on "RuntimeOption::TimeoutsUseWallTime".

Clients could pass in the timeout in a header or query parameter.

If it were not being done at the PHP level, I'd rather it be a header than a parameter useless to third parties (without the same configuration we use on the cluster) that the PHP has to ignore. But if we do use set_time_limit(), then a parameter would be good.

Hmm. When I try it in mediawiki-vagrant, both the ini_set() and set_time_limit() work on the command line but when I hack it into MediaWiki with a busy-loop it doesn't actually time out despite apparently changing the ini setting. Works fine in Zend though.

@Anomie, does the timeout work when set statically in the config?

No, that didn't work either.

ori added a comment.EditedApr 26 2015, 3:18 AM

No, that didn't work either.

Ugh. For some inscrutable reason it only works if you call ini_set() in PHP code.

More specifically: max_execution_time only works via ini_set(), but hhvm.timeouts_use_wall_time must be set in the config file.

ori added a comment.Apr 26 2015, 3:26 AM

Hmm. When I try it in mediawiki-vagrant, both the ini_set() and set_time_limit() work on the command line but when I hack it into MediaWiki with a busy-loop it doesn't actually time out despite apparently changing the ini setting. Works fine in Zend though.

WFM:

[osmium:/srv/mediawiki-local] $ cat busy.php
<?php
ini_set('max_execution_time', 20);
while ( true ) sleep( 1 );

[osmium:/srv/mediawiki-local] $ time curl -s osmium/w/busy.php -o /dev/null

real	0m20.023s
user	0m0.009s
sys	0m0.006s
[osmium:/srv/mediawiki-local] $

Still not working locally in mediawiki-vagrant. I don't seem to have access to osmium anymore to see if anything is different, but in the past I know it was.

vagrant@mediawiki-vagrant:/var/www/w$ php --version
HipHop VM 3.3.1 (rel)
Compiler: heads/master-0-g3efe530cd55a886d88b29acc014dde14c4cff755
Repo schema: b538518f4338eecd9dc2ff13865fd306926fde1b
Extension API: 20140829
vagrant@mediawiki-vagrant:/var/www/w$ grep max_execution_time /etc/hhvm/*
/etc/hhvm/fcgi.ini:max_execution_time = 2
/etc/hhvm/php.ini:max_execution_time = 3

Test file:

<?php
echo "<p>use wall time = " . ini_get('hhvm.timeouts_use_wall_time') . "</p>\n";
echo "<p>Before setting, max time = " . ini_get('max_execution_time') . "</p>\n";
ini_set('max_execution_time', 1);
echo "<p>PHP version: " . PHP_VERSION . "</p>\n";
echo "<p>Max time: " . ini_get('max_execution_time') . "</p>\n";

echo "<p>Sleeping at " . time() . "</p>\n";
sleep(5);

echo "<p>Busy loop at " . time() . "</p>\n";
$t = time()+5;
while(time() < $t){}

echo "<p>No timeout, ending at " . time() . "</p>\n";

Browser output:

wall time = 1

Before setting, max time = 0

PHP version: 5.6.99-hhvm

Max time: 1

Sleeping at 1430089584

Busy loop at 1430089589

No timeout, ending at 1430089594

For good measure I even halted and restarted vagrant, no change.

@ori, @Anomie: The difference between your test cases is that one sleeps, while the other doesn't. Are HHVM's timeouts actually preemptive, or are they relying on the code to return to the even loop before the timeout can trigger?

ori added a comment.Apr 29 2015, 3:50 AM

@ori, @Anomie: The difference between your test cases is that one sleeps, while the other doesn't. Are HHVM's timeouts actually preemptive, or are they relying on the code to return to the even loop before the timeout can trigger?

Brad's test-case sleeps as well, before busy-looping. The timeouts he sets (1, 2, and 3) should expire during sleep(5);.

Hmm, yeah. I guess different HHVM versions?

GWicke moved this task from Backlog to Under discussion on the RESTBase board.Jun 29 2015, 5:34 PM
Arlolra moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Jul 7 2015, 3:08 AM
greg added a subscriber: greg.Sep 29 2016, 7:42 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

I still can't get it to work in mediawiki-vagrant, and the difference in proposed behavior between HHVM and Zend is still unaddressed. IMO this falls under "would be nice, but doesn't seem to actually be possible at this time".

GWicke triaged this task as Normal priority.Oct 12 2016, 7:24 PM
GWicke raised the priority of this task from Normal to High.Dec 2 2016, 7:02 PM

Increased priority, as this contributed to several user-visible outages recently.

Paladox added a subscriber: Paladox.Dec 2 2016, 7:03 PM
GWicke updated the task description. (Show Details)Dec 2 2016, 7:50 PM
mark added a subscriber: mark.

I just found an HHVM issue which suggests that max_execution_time is ignored in FCGI mode: https://github.com/facebook/hhvm/issues/4124

In FCGI HHVM server configuration context, hhvm.server.request_timeout_seconds is suggested to work.

Change 326144 had a related patch set uploaded (by Mark Bergsma):
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

Anomie added a comment.Dec 9 2016, 6:11 PM

Tried my code from T97192#1237258 with hhvm.server.request_timeout_seconds. Even though changing it in /etc/hhvm/fcgi.ini shows up in ini_get(), it still doesn't terminate the request on vagrant.

Also, it doesn't seem that setting can be changed via ini_set() so it wouldn't help with this task that wants to be able to lower the limit on demand rather than globally.

GWicke renamed this task from Support lowering the API request timeout per request to HHVM request timeouts not working; support lowering the API request timeout per request.Dec 9 2016, 6:29 PM
GWicke updated the task description. (Show Details)
Joe added a comment.Dec 10 2016, 7:41 AM

Just for the record, the reason requests piled up in T151702 is because of a low-level deadlock in jemalloc that has nothing to do with managing timeouts in requests, but rather a resource starvation. The description of this task is thus now inaccurate.

Joe added a comment.Dec 12 2016, 7:08 AM

Tried my code from T97192#1237258 with hhvm.server.request_timeout_seconds. Even though changing it in /etc/hhvm/fcgi.ini shows up in ini_get(), it still doesn't terminate the request on vagrant.

Also, it doesn't seem that setting can be changed via ini_set() so it wouldn't help with this task that wants to be able to lower the limit on demand rather than globally.

Yes, I can confirm it doesn't seem to work with our current version, I'll test the latest version from facebook today and if the problem persists, I'll just open a new issue.

In T97192#2861483, @Joe wrote:

Just for the record, the reason requests piled up in T151702 is because of a low-level deadlock in jemalloc that has nothing to do with managing timeouts in requests, but rather a resource starvation. The description of this task is thus now inaccurate.

The point of time limits is to reliably limit the resources any single request can consume, and to provide predictable behavior to callers. This clearly did not work in this case (and others before), so requests were able to pile up.

The reason for slowness of a particular request should have no bearing on the enforcement of request timeouts.

Joe edited projects, added User-Joe; removed Patch-For-Review.Dec 23 2016, 10:10 AM
Joe moved this task from Backlog to Doing on the User-Joe board.Dec 23 2016, 10:59 AM
Joe added a comment.Dec 23 2016, 12:14 PM

After some more extensive testing in the last couple of days, I found that the current vanilla HHVM version (3.17) honors both the request_timeout limit and max_execution_time correctly, while the one we're using currently doesn't really respect any of the various possible timeouts we can define.

So the real solution to the HHVM side of problems is updating it, which we plan to do as soon as 3.18 comes out in a few weeks.

Joe moved this task from Doing to Blocking others on the User-Joe board.Jan 22 2017, 10:38 AM

@Joe, has this been fixed with 3.18?

Joe added a comment.Jun 21 2017, 7:37 AM

@Joe, has this been fixed with 3.18?

Unless someone introduced a regression, yes, it's fixed. At least on appservers and API our timeouts should now be working.

GWicke closed this task as Resolved.Jun 29 2017, 9:09 PM
GWicke claimed this task.

Okay, I verified that the test cases posted in this task now work in labs, using HHVM 3.18.2 via Apache. Assuming production is configured the same way, this makes it likely that the issue is fixed in production as well. Resolving.

Change 326144 abandoned by Mark Bergsma:
Set hhvm.server.request_timeout_seconds to 60s

https://gerrit.wikimedia.org/r/326144

tstarling reopened this task as Open.Sep 6 2018, 7:17 AM
tstarling added a subscriber: tstarling.

It's not fixed, or has regressed. I noticed this today due to T203628 and confirmed it by placing a simple infinite loop script in mwdebug1002's /w directory.

gdb shows the following calls of RequestTimer::setTimeout():

  • On RequestInjectionData::threadInit(), it is called with 60 seconds due to IniSetting::Bind() for max_execution_time
  • Then HttpRequestHandler::handleRequest() calls it with zero seconds. The value can either come from the vhost or the default. The default apparently comes from RuntimeOption::RequestTimeoutSeconds, i.e. hhvm.server.request_timeout_seconds, which is zero for us.
  • Then hphp_invoke() calls RequestInjectionData::resetTimer(), which resets the timeout to the "last one set", here zero.

I confirmed that the userspace functions set_time_limit() and ini_set('max_execution_time', ...) both work to limit execution time. So we could set it in MW userspace configuration.

Setting hhvm.server.request_timeout_seconds in server.ini would probably work. We should do this as well.

As a matter of policy, I think long-running requests like video scaling should call set_time_limit() appropriately.

Change 458623 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Set PHP time limit

https://gerrit.wikimedia.org/r/458623

Change 458624 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Increase Apache timeouts by 2 seconds

https://gerrit.wikimedia.org/r/458624

Change 458624 merged by Tim Starling:
[operations/puppet@production] Increase Apache timeouts by 2 seconds

https://gerrit.wikimedia.org/r/458624

Change 458623 merged by jenkins-bot:
[operations/mediawiki-config@master] Set PHP time limit

https://gerrit.wikimedia.org/r/458623

Mentioned in SAL (#wikimedia-operations) [2018-09-10T00:04:36Z] <tstarling@deploy1001> Synchronized wmf-config/set-time-limit.php: T97192 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2018-09-10T00:07:18Z] <tstarling@deploy1001> Synchronized wmf-config/PhpAutoPrepend.php: T97192 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-09-10T00:12:04Z] <tstarling@deploy1001> Synchronized w/infinite-loop.php: Testing for T97192 (duration: 00m 48s)

hashar added a subscriber: hashar.

While promoting 1.32.0-wmf.22 yesterday and today, I noticed a spike of web request took longer than 60 seconds and timed out. I suspect it always happened and it is now showing up thanks to the patches above. I have filled the issue as T204871

Joe added a comment.Sep 21 2018, 7:56 AM

I think this ticket can be closed thanks to @tstarling's work.

tstarling closed this task as Resolved.Sep 25 2018, 1:07 PM