[Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Krinkle
	May 14 2019, 10:33 PM

Description

I've been testing fatal errors a lot lately and sometime between yesterday and today, this stopped working for all production requests.

On HHVM fatal errors still result in HTTP 500 with contents provided by HHVM itself via hhvm-fatal-error.php. Example:

Screenshot 2019-05-14 at 23.32.40.png (1×1 px, 136 KB)

But, on PHP 7 with the A-B test cookie set (note, this is not using X-Wikimedia-Debug or anything like that), we now get a HTTP 503 status code, without the expected error page served that explains what happened:

Screenshot 2019-05-14 at 23.34.54.png (1×1 px, 268 KB)

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T176370 Migrate to PHP 7 in WMF production
Resolved	tstarling	T187147 Port mediawiki/php/wmerrors to PHP7 and deploy
Declined	None	T223336 [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.

Event Timeline

Krinkle created this task.May 14 2019, 10:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2019, 10:33 PM

Krinkle added a project: Performance-Team (Radar).May 14 2019, 10:34 PM

Krinkle mentioned this in T217846: PHP fatal error handler not working on mwdebug servers.

Tagging monitoring as well because we generally associate HTTP 500 with application errors, and HTTP 503 with traffic/infra problems. If the PHP 7 roll out changes that, this would complicate some matters and make incident investigation more difficult.

Krinkle added a project: Traffic.May 14 2019, 10:39 PM

jijiki added a project: User-jijiki.May 14 2019, 10:41 PM

jijiki subscribed.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.May 15 2019, 12:01 AM

ArielGlenn triaged this task as High priority.May 15 2019, 10:39 AM

Please provide the full responses, including headers, returned by the HHVM and PHP7 origin servers.

Hi, I've tested a few combinations of errors, and the the only case where this happens is when you choose action=segfault.

In that case, the 503 is returned directly from apache, so probably from php-fpm. If this behaviour changed from earlier, I guess the problem is in the code, as no configuration was changed.

Varnish doesn't do any transformation here and just reproduces what it's getting from the backend.

Joe renamed this task from [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" to [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..May 15 2019, 11:54 AM

Joe removed a project: Traffic.

Joe added a project: serviceops.

I changed the title of the task to reflect myt findings, and changed the associated tags accordingly

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).May 15 2019, 1:28 PM

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.May 20 2019, 8:22 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle moved this task from Untriaged to Wikimedia production on the PHP 7.2 support board.May 23 2019, 12:39 AM

jijiki moved this task from Inbox 🐅 to In Progress 🏋️‍♀️ on the User-jijiki board.Jun 5 2019, 6:00 PM

Joe added a parent task: T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.Jun 21 2019, 10:17 AM

As I explained in T187147#5295715, my understanding is that in case of a segfault php-fpm fails to properly respond in any ways, forcing apache to produce a 503 error given it's working in proxy mode.

Using a modified version of furl that now supports unix sockets, for segfaults I get:

$ sudo furl --server unix:///run/php/fpm-www.sock --script /w/fatal-error.php --docroot /srv/mediawiki/docroot/wikipedia.org/ 'http://en.wikipedia.org/w/fatal-error.php?password=<redacted>&action=segfault'
Fatal error: Uncaught exception 'Adoy\FastCGI\ForbiddenException' with message 'Not in white list. Check listen.allowed_clients.' in /home/oblivian/furl:585
Stack trace:
#0 /home/oblivian/furl(450): Adoy\FastCGI\Client->wait_for_response()
#1 /home/oblivian/furl(648): Adoy\FastCGI\Client->request()
#2 /home/oblivian/furl(695): Adoy\FastCGI\doFcgiRequest()
#3 {main}

Looking at furl's code, this happens when the server closes the connection abruptly without sending back any FCGI data (EOF is reached and no data is left to return and no valid FCGI response has been sent).

There isn't much that can be done there from what I can see.

jijiki moved this task from Incoming 🐫 to API Gateway 🥌 on the serviceops board.Jul 19 2019, 8:48 AM

OK. I'm fine with this staying as it is. It's not really broken. It's just that under FPM and PHP7 (vs HHVM), this simply results in a different kind of error, and thus has a different error page. Coming from the FPM/Apache layer rather than MW/PHP itself.

	F29052817: Screenshot 2019-05-14 at 23.34.54.png
	May 14 2019, 10:33 PM

	F29052804: Screenshot 2019-05-14 at 23.32.40.png
	May 14 2019, 10:33 PM

[Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.Closed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.
Closed, DeclinedPublic
Actions

Related Objects
Search...