Before sending it to Projects-Cleanup, is the wmerrors PHP extension still useful? In T56163: Port wmerrors / php-fatal-error.html to HHVM it was decided that we didn't need this for HHVM, will we need it again for PHP 7? Or can it be safely archived?

Summary

See research at T187147#4817837 and T187147#5165179.

Good:

Recoverable and unrecoverable fatal errors have events emitted on PHP 7.2 similar to how HHVM provided these (except PHP7 uses a Throwable as uncaught exception, instead of an Error event; MediaWiki already tracks both).

Problems:

No stack traces, at all, for PHP 7 fatal errors. HHVM provided these through a custom mechanism. For PHP 5, we had a custom extension. We might need that again, or find a different clever workaround if possible.
Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated. More info at T187147#5165518. (fixed with https://phabricator.wikimedia.org/T187147#5182892)
Catchable fatals are reported as exceptions (in exception.log) and use our custom fatal error page handler.

Details

Subject	Repo	Branch	Lines +/-
Add wmerrors.error_script_file	mediawiki/php/wmerrors	master	+94 -893
Use require instead of include in ServiceConfig.php	operations/mediawiki-config	master	+3 -5
mediawiki::php: ensure sapis error is correct for wmerrors:	operations/puppet	production	+1 -1
mediawiki::php: install wmerrors everywhere	operations/puppet	production	+2 -6
wmerrors: enable on the mwdebug servers	operations/puppet	production	+1 -0
mediawiki::php: add a fatal error page to go with the proposed wmerrors feature	operations/puppet	production	+155 -0
Add a fatal error page to go with the proposed wmerrors feature	operations/mediawiki-config	master	+111 -0
logger: Produce traces for all Throwables	mediawiki/core	wmf/1.34.0-wmf.6	+87 -13
logger: Produce traces for all Throwables	mediawiki/core	master	+87 -13
Port to PHP 7.2	mediawiki/php/wmerrors	master	+167 -332
wmerrors is moving to PHP 7 exclusively	integration/config	master	+6 -2
logstash: enforce max length on "message" and "msg" fields	operations/puppet	production	+16 -0
logstash: add logstash-filter-truncate plugin	operations/software/logstash/plugins	master	+4 -1
exception: Document the three ways we listen for errors/fatals/exceptions	mediawiki/core	master	+23 -0
errorpages: Remove unused php-fatal-error.html file	operations/mediawiki-config	master	+0 -33

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T176370 Migrate to PHP 7 in WMF production
Resolved	• tstarling	T187147 Port mediawiki/php/wmerrors to PHP7 and deploy
Resolved	• tstarling	T224076 wmerrors has no license information
Declined	None	T223336 [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T187147#5172711, @Krinkle wrote:

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

Ok, that's fair. And I did have success testing the logstash-filter-truncate plugin in beta, so we should be able to use this plugin in prod as long as it doesn't melt under the load. I'll work on an updated logstash plugin bundle and a config to get that deployed.

herron added a project: User-herron.May 10 2019, 5:16 PM

herron added projects: SRE, MediaWiki-Logevents.

herron moved this task from Backlog to Working on on the User-herron board.

Michael subscribed.May 13 2019, 10:56 AM

herron added a project: Wikimedia-Logstash.May 13 2019, 3:22 PM

herron moved this task from Backlog to In Dev/Progress on the Wikimedia-Logstash board.May 13 2019, 3:25 PM

Change 509880 had a related patch set uploaded (by Herron; owner: Herron):
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 509880 merged by Herron:
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 merged by Herron:
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 486840 restored by Tim Starling:
WIP: port to PHP 7.2

Reason:
Actually Platform is interested

https://gerrit.wikimedia.org/r/486840

Change 511627 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Change 511627 merged by jenkins-bot:
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Legoktm added a project: php-wmerrors.May 21 2019, 8:05 AM

• kchapman assigned this task to • tstarling.May 21 2019, 12:56 PM

• kchapman edited projects, added Platform Team Workboards (Doing); removed Platform Team Legacy.

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

Also, catchable fatal errors are displayed in an ugly way: the format is now "Uncaught %s\n thrown", where %s is the entire exception object converted to string, including the multi-line backtrace. So wmerrors log entries for catchable fatal errors now have the backtrace twice. That's fixable.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

To provide backtraces for OOMs and timeouts
To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.
For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

For flexibility and to avoid duplication, ideally error handling would just be done in userspace. A comment in the PHP code says that it's unsafe to run userspace code when handling fatal errors. That's not quite as true as it used to be, but I still have some concerns. PHP 7 runs userspace code after zend_bailout(), which wipes the current execution stack. Perhaps if you run userspace code without first wiping the stack, the code would be exposed to stack corruption due to an OOM occurring in the middle of stack modification. We could just try it and see how often it crashes.

You can certainly run userspace code on timeout, that's basically what we're doing in Excimer.

In T187147#5207128, @tstarling wrote:

To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

Your comment inspired me to track down the bug. Turns out it was just some code that was still checking for Exception only rather than Throwable.

Change 512179 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 512179 merged by jenkins-bot:
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 486840 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Port to PHP 7.2

https://gerrit.wikimedia.org/r/486840

Legoktm added a subtask: T224076: wmerrors has no license information.May 24 2019, 9:35 PM

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.7; 2019-05-28); removed MW-1.34-notes (1.34.0-wmf.5; 2019-05-14).May 25 2019, 3:01 AM

Change 512533 had a related patch set uploaded (by Krinkle; owner: Anomie):
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Change 512533 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Mentioned in SAL (#wikimedia-operations) [2019-05-26T13:37:39Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/debug: T187147 / rMW2be7aa4bc4af (duration: 00m 51s)

Krinkle removed a project: Patch-For-Review.May 26 2019, 1:39 PM

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.6; 2019-05-21); removed MW-1.34-notes (1.34.0-wmf.7; 2019-05-28).May 26 2019, 2:00 PM

jijiki subscribed.May 27 2019, 6:10 AM

hashar awarded a token.May 27 2019, 9:08 AM

In T187147#5207128, @tstarling wrote:

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

I would say reproduce the behaviour of HHVM: uncatchable fatal errors are logged to syslog by HHVM directly, with log level error. We generally require services to use the cee format which sends indeed a json structure, which makes the message way more parsable/indexable on logstash.

See https://wikitech.wikimedia.org/wiki/Logstash/SRE_onboard for a brief introduction on how logging from clients is supposed to work.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

To provide backtraces for OOMs and timeouts

Correct. Timeouts (the ones where Excimer can't timeout on its own) and OOMs should be handled by wmerrors as php-fpm is not able to handle them itself besides logging an error. We're interested not only in handling the message best we can, but also in adding a backtrace when possible.

For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

we'd like probably to be able to use the same php page for both? That's currently modules/mediawiki/templates/hhvm-fatal-error.php.erb in ops/puppet

To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

AIUI this was fixed by @Anomie's patch already?

Krinkle mentioned this in T224488: PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found.May 28 2019, 3:39 PM

Krinkle mentioned this in T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262).May 28 2019, 4:10 PM

Example from T224491 that lacked the original stack trace.

PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}

PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

Request ID: XO1PdwpAADkAAAL-DwYAAABE
Request   : GET enwiki /wiki/{…} (page view)

herron updated the task description. (Show Details)Jun 4 2019, 3:19 PM

Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."

These will no longer be dropped by ELK, but very long messages will still contain truncated raw JSON. We would need to truncate the fatal_exception.trace field as it is emitted by mw to address that afaict.

Change 516763 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

gerritbot added a project: Patch-For-Review.Jun 13 2019, 11:17 PM

Change 516975 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516975

Legoktm closed subtask T224076: wmerrors has no license information as Resolved.Jun 14 2019, 2:53 AM

Change 516975 abandoned by Tim Starling:
Add a fatal error page to go with the proposed wmerrors feature

Reason:
Moving to puppet

https://gerrit.wikimedia.org/r/516975

Change 516988 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

Joe added a subtask: T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..Jun 21 2019, 10:17 AM

Joe added a project: serviceops.

Joe moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Jun 21 2019, 10:19 AM

@Legoktm kindly did my job and created an upstream package for wmerrors. Thanks a lot!

@tstarling I see a few patches for wmerrors are open, should I wait for any of those to be merged before creating a package?

If not, I think we can test wmerrors early next week and complete a deployment to production in a few days.

Joe moved this task from Doing 😎 to this.quarter 🍕 on the serviceops board.Jun 24 2019, 6:33 AM

Change 516763 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Joe moved this task from this.quarter 🍕 to Doing 😎 on the serviceops board.Jun 27 2019, 4:54 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-28T11:04:24Z] <_joe_> uploading php-wmerrors to thirdparty/php72 - T187147

Change 516988 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

Change 519961 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

Change 519961 merged by Giuseppe Lavagetto:
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

After installing wmerrors on the test servers, these are my results:

OOM errors are now correctly treated: we get the correct error page and the trace is sent to fatal.log
Timeout errors are not treated correctly : we get a semi-blank error page (which I think is the default exception page for MediaWiki) and the trace is sent to exception.log, but it contains only a reference to the excimer Timeout exception (not ideal).
Nomethod errors behave like timeout errors - they get sent to exception.log and show the default error page
Segfaults nothing has changed - we get no stack trace anywhere and the error page has no information on what went wrong.

The result for segfaults doesn't surprise me - because of the way php-fpm works, the whole worker process crashes and thus there is no way to handle failure elsewhere. But I think for the rest of the uncatchable fatals (like OOMs) wmerrors solved the issue.

I think we need to:

Configure a better-looking error page to be shown for exceptions
Maybe(?) modify our logging so that fatal exceptions are logged to fatal.log, which looks like the obvious place where this should happen.

I will proceed and deploy wmerrors fleet-wide, and update the task description.

@tstarling (or anyone else) any suggestion on how to proceed? I would imagine that letting wmerrors handle catchable fatals would be ok? Or do we prefer to properly configure how we report them (I'm not even sure if that's possible)

Joe updated the task description. (Show Details)Jul 1 2019, 10:51 AM

Change 519986 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519986 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519990 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mediawiki::php: ensure sapis error is correct for wmerrors:

https://gerrit.wikimedia.org/r/519990

Change 519990 abandoned by Jbond:
mediawiki::php: ensure sapis error is correct for wmerrors:

Reason:
not the correct fix

https://gerrit.wikimedia.org/r/519990

Joe mentioned this in T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..Jul 1 2019, 1:45 PM

Talking with @Krinkle I realized that the reason why I saw that ugly error message is because the endpoint doesn't initialize the skin at all. Actual fatals on real pages should include the skin.

Also, having those fatals in exception.log is not a problem apparently.

• WDoranWMF moved this task from Doing to Team 1 on the Platform Team Workboards board.Jul 5 2019, 5:45 PM

• WDoranWMF edited projects, added Platform Team Workboards (Team 1); removed Platform Team Workboards (Doing).

• WDoranWMF moved this task from Backlog to Doing on the Platform Team Workboards (Team 1) board.

jijiki moved this task from Doing 😎 to API Gateway 🥌 on the serviceops board.Jul 12 2019, 10:44 AM

• WDoranWMF edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Team Workboards (Team 1).Jul 17 2019, 12:23 AM

Is this blocking deployment of PHP 7?

In T187147#5343202, @tstarling wrote:

Is this blocking deployment of PHP 7?

In my opinion, it should not at this point. But I'd like to hear @Krinkle's opinion too.

The first point was blocking (proper traces), and has been checked off. The remaining work should not be blocking in my opinion.

I've now confirmed the traces for Logstash as well:

oom: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
nomethod: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
timeout: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
segfault: HHVM (nothing). PHP7 tries to emit to logstash/mw/fatal and a truncated version of this message reaches Logstash under _type:mediawiki/channel:jsonTruncated.

LGTM.

Joe edited projects, added serviceops-radar; removed serviceops.Jul 25 2019, 1:44 PM

Krinkle edited projects, added MediaWiki-Debug-Logger; removed MediaWiki-Logevents.Jul 26 2019, 5:03 PM

• WDoranWMF moved this task from PHP7 (TEC4) to mop on the Platform Engineering board.Jul 26 2019, 6:33 PM

• WDoranWMF edited projects, added Core Platform Team Initiatives (PHP7 (TEC4)); removed Platform Engineering (PHP7 (TEC4)).

• WDoranWMF moved this task from Inbox to Later on the Platform Team Workboards (Clinic Duty Team) board.Jul 29 2019, 3:51 PM

• WDoranWMF moved this task from Later to Ready (WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.

• WDoranWMF moved this task from Ready (WIP:5) to Later on the Platform Team Workboards (Clinic Duty Team) board.

• WDoranWMF removed a project: Platform Team Workboards (Clinic Duty Team).Jul 29 2019, 11:09 PM

• WDoranWMF added a project: Platform Team Workboards (Clinic Duty Team).

• WDoranWMF moved this task from Later to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Jul 29 2019, 11:12 PM

fgiunchedi moved this task from In Dev/Progress to Radar on the Wikimedia-Logstash board.Aug 6 2019, 12:19 PM

• WDoranWMF moved this task from Backlog to Inbox on the Platform Team Workboards (Clinic Duty Team) board.Aug 13 2019, 6:22 PM

What's the current status of this task? Are there needs from CPT?

• WDoranWMF moved this task from Inbox to Blocked Externally on the Platform Team Workboards (Clinic Duty Team) board.Aug 13 2019, 6:28 PM

fgiunchedi added a project: observability.Aug 19 2019, 2:28 PM

The last point of the task description is wmerrors taking care of showing the error page (to inform the user what happened and give them debug info they can share with developers + dispatch a message also to Logstash and Statsd).

Afaik that was done now by @tstartling as part of:

In T187147#5287846, @gerritbot wrote:

Change 516763 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Closing assuming that there is no further work tracked here.

Krinkle updated the task description. (Show Details)Aug 29 2019, 1:37 AM

Maintenance_bot removed a project: Patch-For-Review.Aug 29 2019, 2:10 AM

daniel moved this task from Blocked Externally to Done on the Platform Team Workboards (Clinic Duty Team) board.Aug 30 2019, 7:14 AM

Krinkle closed subtask T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. as Declined.Sep 16 2019, 11:22 PM

Change 627563 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/mediawiki-config@master] Use require instead of include in ServiceConfig.php

https://gerrit.wikimedia.org/r/627563

gerritbot added a project: Patch-For-Review.Sep 15 2020, 5:07 PM

Change 627563 merged by jenkins-bot:
[operations/mediawiki-config@master] Use require instead of include in ServiceConfig.php

https://gerrit.wikimedia.org/r/627563

Aklapper removed subscribers: • Imarlier, Anomie.Oct 16 2020, 5:40 PM

Legoktm mentioned this in T300827: Replace wmerrors PHP extension with functionality built into PHP itself.Feb 3 2022, 6:25 AM