Page MenuHomePhabricator

Port mediawiki/php/wmerrors to PHP7 and deploy
Closed, ResolvedPublic

Description

Before sending it to Projects-Cleanup, is the wmerrors PHP extension still useful? In T56163: Port wmerrors / php-fatal-error.html to HHVM it was decided that we didn't need this for HHVM, will we need it again for PHP 7? Or can it be safely archived?

Summary

See research at T187147#4817837 and T187147#5165179.

Good:

  • Recoverable and unrecoverable fatal errors have events emitted on PHP 7.2 similar to how HHVM provided these (except PHP7 uses a Throwable as uncaught exception, instead of an Error event; MediaWiki already tracks both).

Problems:

  • No stack traces, at all, for PHP 7 fatal errors. HHVM provided these through a custom mechanism. For PHP 5, we had a custom extension. We might need that again, or find a different clever workaround if possible.
  • Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated. More info at T187147#5165518. (fixed with https://phabricator.wikimedia.org/T187147#5182892)
  • Catchable fatals are reported as exceptions (in exception.log) and use our custom fatal error page handler.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

Ok, that's fair. And I did have success testing the logstash-filter-truncate plugin in beta, so we should be able to use this plugin in prod as long as it doesn't melt under the load. I'll work on an updated logstash plugin bundle and a config to get that deployed.

Change 509880 had a related patch set uploaded (by Herron; owner: Herron):
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 509880 merged by Herron:
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 merged by Herron:
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 486840 restored by Tim Starling:
WIP: port to PHP 7.2

Reason:
Actually Platform is interested

https://gerrit.wikimedia.org/r/486840

Change 511627 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Change 511627 merged by jenkins-bot:
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

Also, catchable fatal errors are displayed in an ugly way: the format is now "Uncaught %s\n thrown", where %s is the entire exception object converted to string, including the multi-line backtrace. So wmerrors log entries for catchable fatal errors now have the backtrace twice. That's fixable.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts
  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.
  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

For flexibility and to avoid duplication, ideally error handling would just be done in userspace. A comment in the PHP code says that it's unsafe to run userspace code when handling fatal errors. That's not quite as true as it used to be, but I still have some concerns. PHP 7 runs userspace code after zend_bailout(), which wipes the current execution stack. Perhaps if you run userspace code without first wiping the stack, the code would be exposed to stack corruption due to an OOM occurring in the middle of stack modification. We could just try it and see how often it crashes.

You can certainly run userspace code on timeout, that's basically what we're doing in Excimer.

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

Your comment inspired me to track down the bug. Turns out it was just some code that was still checking for Exception only rather than Throwable.

Change 512179 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 512179 merged by jenkins-bot:
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 486840 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Port to PHP 7.2

https://gerrit.wikimedia.org/r/486840

Change 512533 had a related patch set uploaded (by Krinkle; owner: Anomie):
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Change 512533 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Mentioned in SAL (#wikimedia-operations) [2019-05-26T13:37:39Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/debug: T187147 / rMW2be7aa4bc4af (duration: 00m 51s)

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

I would say reproduce the behaviour of HHVM: uncatchable fatal errors are logged to syslog by HHVM directly, with log level error. We generally require services to use the cee format which sends indeed a json structure, which makes the message way more parsable/indexable on logstash.

See https://wikitech.wikimedia.org/wiki/Logstash/SRE_onboard for a brief introduction on how logging from clients is supposed to work.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts

Correct. Timeouts (the ones where Excimer can't timeout on its own) and OOMs should be handled by wmerrors as php-fpm is not able to handle them itself besides logging an error. We're interested not only in handling the message best we can, but also in adding a backtrace when possible.

  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

we'd like probably to be able to use the same php page for both? That's currently modules/mediawiki/templates/hhvm-fatal-error.php.erb in ops/puppet

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

AIUI this was fixed by @Anomie's patch already?

Example from T224491 that lacked the original stack trace.

PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}

PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

Request ID: XO1PdwpAADkAAAL-DwYAAABE
Request   : GET enwiki /wiki/{…} (page view)

Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."

These will no longer be dropped by ELK, but very long messages will still contain truncated raw JSON. We would need to truncate the fatal_exception.trace field as it is emitted by mw to address that afaict.

Change 516763 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Change 516975 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516975

Change 516975 abandoned by Tim Starling:
Add a fatal error page to go with the proposed wmerrors feature

Reason:
Moving to puppet

https://gerrit.wikimedia.org/r/516975

Change 516988 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

@Legoktm kindly did my job and created an upstream package for wmerrors. Thanks a lot!

@tstarling I see a few patches for wmerrors are open, should I wait for any of those to be merged before creating a package?

If not, I think we can test wmerrors early next week and complete a deployment to production in a few days.

Change 516763 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Mentioned in SAL (#wikimedia-operations) [2019-06-28T11:04:24Z] <_joe_> uploading php-wmerrors to thirdparty/php72 - T187147

Change 516988 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

Change 519961 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

Change 519961 merged by Giuseppe Lavagetto:
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

After installing wmerrors on the test servers, these are my results:

  • OOM errors are now correctly treated: we get the correct error page and the trace is sent to fatal.log
  • Timeout errors are not treated correctly : we get a semi-blank error page (which I think is the default exception page for MediaWiki) and the trace is sent to exception.log, but it contains only a reference to the excimer Timeout exception (not ideal).
  • Nomethod errors behave like timeout errors - they get sent to exception.log and show the default error page
  • Segfaults nothing has changed - we get no stack trace anywhere and the error page has no information on what went wrong.

The result for segfaults doesn't surprise me - because of the way php-fpm works, the whole worker process crashes and thus there is no way to handle failure elsewhere. But I think for the rest of the uncatchable fatals (like OOMs) wmerrors solved the issue.

I think we need to:

  • Configure a better-looking error page to be shown for exceptions
  • Maybe(?) modify our logging so that fatal exceptions are logged to fatal.log, which looks like the obvious place where this should happen.

I will proceed and deploy wmerrors fleet-wide, and update the task description.

@tstarling (or anyone else) any suggestion on how to proceed? I would imagine that letting wmerrors handle catchable fatals would be ok? Or do we prefer to properly configure how we report them (I'm not even sure if that's possible)

Change 519986 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519986 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519990 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mediawiki::php: ensure sapis error is correct for wmerrors:

https://gerrit.wikimedia.org/r/519990

Change 519990 abandoned by Jbond:
mediawiki::php: ensure sapis error is correct for wmerrors:

Reason:
not the correct fix

https://gerrit.wikimedia.org/r/519990

Talking with @Krinkle I realized that the reason why I saw that ugly error message is because the endpoint doesn't initialize the skin at all. Actual fatals on real pages should include the skin.

Also, having those fatals in exception.log is not a problem apparently.

Is this blocking deployment of PHP 7?

Is this blocking deployment of PHP 7?

In my opinion, it should not at this point. But I'd like to hear @Krinkle's opinion too.

The first point was blocking (proper traces), and has been checked off. The remaining work should not be blocking in my opinion.

I've now confirmed the traces for Logstash as well:

  • oom: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • nomethod: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • timeout: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • segfault: HHVM (nothing). PHP7 tries to emit to logstash/mw/fatal and a truncated version of this message reaches Logstash under _type:mediawiki/channel:jsonTruncated.

LGTM.

What's the current status of this task? Are there needs from CPT?

The last point of the task description is wmerrors taking care of showing the error page (to inform the user what happened and give them debug info they can share with developers + dispatch a message also to Logstash and Statsd).

Afaik that was done now by @tstartling as part of:

Change 516763 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Closing assuming that there is no further work tracked here.

Change 627563 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/mediawiki-config@master] Use require instead of include in ServiceConfig.php

https://gerrit.wikimedia.org/r/627563

Change 627563 merged by jenkins-bot:
[operations/mediawiki-config@master] Use require instead of include in ServiceConfig.php

https://gerrit.wikimedia.org/r/627563