Page MenuHomePhabricator

[Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.
Closed, DeclinedPublic

Description

I've been testing fatal errors a lot lately and sometime between yesterday and today, this stopped working for all production requests.

On HHVM fatal errors still result in HTTP 500 with contents provided by HHVM itself via hhvm-fatal-error.php. Example:

Screenshot 2019-05-14 at 23.32.40.png (1×1 px, 136 KB)

But, on PHP 7 with the A-B test cookie set (note, this is not using X-Wikimedia-Debug or anything like that), we now get a HTTP 503 status code, without the expected error page served that explains what happened:

Screenshot 2019-05-14 at 23.34.54.png (1×1 px, 268 KB)

Event Timeline

Tagging monitoring as well because we generally associate HTTP 500 with application errors, and HTTP 503 with traffic/infra problems. If the PHP 7 roll out changes that, this would complicate some matters and make incident investigation more difficult.

Please provide the full responses, including headers, returned by the HHVM and PHP7 origin servers.

Hi, I've tested a few combinations of errors, and the the only case where this happens is when you choose action=segfault.

In that case, the 503 is returned directly from apache, so probably from php-fpm. If this behaviour changed from earlier, I guess the problem is in the code, as no configuration was changed.

Varnish doesn't do any transformation here and just reproduces what it's getting from the backend.

Joe renamed this task from [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" to [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..May 15 2019, 11:54 AM
Joe removed a project: Traffic.
Joe added a project: serviceops.

I changed the title of the task to reflect myt findings, and changed the associated tags accordingly

As I explained in T187147#5295715, my understanding is that in case of a segfault php-fpm fails to properly respond in any ways, forcing apache to produce a 503 error given it's working in proxy mode.

Using a modified version of furl that now supports unix sockets, for segfaults I get:

$ sudo furl --server unix:///run/php/fpm-www.sock --script /w/fatal-error.php --docroot /srv/mediawiki/docroot/wikipedia.org/ 'http://en.wikipedia.org/w/fatal-error.php?password=<redacted>&action=segfault'
Fatal error: Uncaught exception 'Adoy\FastCGI\ForbiddenException' with message 'Not in white list. Check listen.allowed_clients.' in /home/oblivian/furl:585
Stack trace:
#0 /home/oblivian/furl(450): Adoy\FastCGI\Client->wait_for_response()
#1 /home/oblivian/furl(648): Adoy\FastCGI\Client->request()
#2 /home/oblivian/furl(695): Adoy\FastCGI\doFcgiRequest()
#3 {main}

Looking at furl's code, this happens when the server closes the connection abruptly without sending back any FCGI data (EOF is reached and no data is left to return and no valid FCGI response has been sent).

There isn't much that can be done there from what I can see.

OK. I'm fine with this staying as it is. It's not really broken. It's just that under FPM and PHP7 (vs HHVM), this simply results in a different kind of error, and thus has a different error page. Coming from the FPM/Apache layer rather than MW/PHP itself.