Page MenuHomePhabricator

[Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.
Open, HighPublic

Description

I've been testing fatal errors a lot lately and sometime between yesterday and today, this stopped working for all production requests.

On HHVM fatal errors still result in HTTP 500 with contents provided by HHVM itself via hhvm-fatal-error.php. Example:

But, on PHP 7 with the A-B test cookie set (note, this is not using X-Wikimedia-Debug or anything like that), we now get a HTTP 503 status code, without the expected error page served that explains what happened:

Event Timeline

Krinkle created this task.May 14 2019, 10:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2019, 10:33 PM

Tagging monitoring as well because we generally associate HTTP 500 with application errors, and HTTP 503 with traffic/infra problems. If the PHP 7 roll out changes that, this would complicate some matters and make incident investigation more difficult.

jijiki added a subscriber: jijiki.
ArielGlenn triaged this task as High priority.May 15 2019, 10:39 AM
ema added a subscriber: ema.May 15 2019, 10:43 AM

Please provide the full responses, including headers, returned by the HHVM and PHP7 origin servers.

Joe added a subscriber: Joe.May 15 2019, 11:53 AM

Hi, I've tested a few combinations of errors, and the the only case where this happens is when you choose action=segfault.

In that case, the 503 is returned directly from apache, so probably from php-fpm. If this behaviour changed from earlier, I guess the problem is in the code, as no configuration was changed.

Varnish doesn't do any transformation here and just reproduces what it's getting from the backend.

Joe renamed this task from [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" to [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..May 15 2019, 11:54 AM
Joe removed a project: Traffic.
Joe added a project: serviceops.
Joe added a comment.May 15 2019, 11:57 AM

I changed the title of the task to reflect myt findings, and changed the associated tags accordingly

Gilles moved this task from Inbox to Radar on the Performance-Team board.May 20 2019, 8:22 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
jijiki moved this task from Backlog/Radar to In Progress on the User-jijiki board.Jun 5 2019, 6:00 PM
Joe added a comment.Mon, Jul 1, 1:45 PM

As I explained in T187147#5295715, my understanding is that in case of a segfault php-fpm fails to properly respond in any ways, forcing apache to produce a 503 error given it's working in proxy mode.

Joe added a comment.Mon, Jul 1, 2:38 PM

Using a modified version of furl that now supports unix sockets, for segfaults I get:

$ sudo furl --server unix:///run/php/fpm-www.sock --script /w/fatal-error.php --docroot /srv/mediawiki/docroot/wikipedia.org/ 'http://en.wikipedia.org/w/fatal-error.php?password=<redacted>&action=segfault'
Fatal error: Uncaught exception 'Adoy\FastCGI\ForbiddenException' with message 'Not in white list. Check listen.allowed_clients.' in /home/oblivian/furl:585
Stack trace:
#0 /home/oblivian/furl(450): Adoy\FastCGI\Client->wait_for_response()
#1 /home/oblivian/furl(648): Adoy\FastCGI\Client->request()
#2 /home/oblivian/furl(695): Adoy\FastCGI\doFcgiRequest()
#3 {main}

Looking at furl's code, this happens when the server closes the connection abruptly without sending back any FCGI data (EOF is reached and no data is left to return and no valid FCGI response has been sent).

There isn't much that can be done there from what I can see.

jijiki moved this task from Backlog to Next up on the serviceops board.Fri, Jul 19, 8:48 AM