Page MenuHomePhabricator

Port mediawiki/php/wmerrors to PHP7 and deploy
Open, HighPublic

Description

Before sending it to Cleanup, is the wmerrors PHP extension still useful? In T56163: Port wmerrors / php-fatal-error.html to HHVM it was decided that we didn't need this for HHVM, will we need it again for PHP 7? Or can it be safely archived?

Summary

See research at T187147#4817837 and T187147#5165179.

Good:

  • Recoverable and unrecoverable fatal errors have events emitted on PHP 7.2 similar to how HHVM provided these (except PHP7 uses a Throwable as uncaught exception, instead of an Error event; MediaWiki already tracks both).

Problems:

  • No stack traces, at all, for PHP 7 fatal errors. HHVM provided these through a custom mechanism. For PHP 5, we had a custom extension. We might need that again, or find a different clever workaround if possible.
  • Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated. More info at T187147#5165518. (fixed with https://phabricator.wikimedia.org/T187147#5182892)
  • Catchable fatals are reported as exceptions (in exception.log) and don't use the same error page as wmerrors

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 508653 merged by jenkins-bot:
[mediawiki/core@master] exception: Document the three ways we listen for errors/fatals/exceptions

https://gerrit.wikimedia.org/r/508653

Krinkle added a comment.EditedMay 7 2019, 8:17 PM

These kinds of oversized messages used to be trimmed between (r)syslog and Logstash, and we still have a logstash-filter for it at puppet: logstash/filter-syslog.conf. However, this no longer works because it seems the transport between MediaWiki and Logstash (Monolog, rsyslog, kafka) no longer appears to have any limit (or at least not a limit we are exceeding here).

As such, the oversized values actually arrive at Logstash in full, and are run through all the filters as normal (including the above filter-syslog.conf, but mostly a no-op for our purposes), but then rejected per T187147#5165518 due to being too big for the ElasticSearch index.

So.. I guess we could update the logstash-filter to instead mutate the message after json-parsing to trim any fields that exceeds the limit Logstash can accept in its term index (and as before, tag syslog_truncated to record this fact).

Krinkle updated the task description. (Show Details)May 8 2019, 2:07 PM
kchapman added a subscriber: kchapman.

CPT needs to review and plan for long term fixes to this.

herron added a comment.May 8 2019, 8:21 PM

Comparing (in beta) a working mediawiki log message and a log message failing with max_bytes_length_exceeded_exception I'm noticing differences in json formatting as well. For example

Working log (as read from kafka):

{
  "timestamp": "2019-05-08T19:40:25+00:00",
  "logsource": "deployment-jobrunner03",
  "host": "deployment-jobrunner03",
  "program": "mediawiki",
  "severity": "info",
  "facility": "user",
  "@timestamp": "2019-05-08T19:40:25.000000+00:00",
  "@version": 1,
  "message": "MessageCache::load: Loading en... got from local cache",
  "type": "mediawiki",
  "channel": "MessageCache",
  "level": "INFO",
  "shard": "s3",
  "normalized_message": "MessageCache::load: Loading en... got from local cache",
  "phpversion": "5.6.99-hhvm",
  "url": "\/rpc\/RunSingleJob.php",
  "ip": "172.16.4.124",
  "http_method": "POST",
  "server": "deployment-jobrunner03.deployment-prep.eqiad.wmflabs",
  "referrer": null,
  "wiki": "wikidatawiki",
  "mwversion": "1.34.0-alpha",
  "reqId": "6b7a56ab568d318dbdb55df5",
  "private": false
}

Log message causing logstash/elasticsearch max_bytes_length_exceeded_exception triggered by fatal-error.php (as read from kafka, truncated manually at <snip>)

{
  "timestamp": "2019-05-08T19:41:28+00:00",
  "logsource": "deployment-mediawiki-07",
  "host": "deployment-mediawiki-07",
  "program": "mediawiki",
  "severity": "err",
  "facility": "user",
  "msg": " {\"@timestamp\":\"2019-05-08T19:41:28.000000+00:00\",\"@version\":1,\"host\":\"deployment-mediawiki-07\",\"message\":\"[XNMw6KwQBHcAACGLaLoAAAAJ] \/w\/fatal-error.php?password=redacted&action=segfault&who=keith   PHP Fatal Error from line 148 of \/srv\/mediawiki\/w\/fatal-error.php: Stack overflow\",\"type\":\"mediawiki\",\"channel\":\"fatal\",\"level\":\"ERROR\",\"shard\":\"s3\",\"normalized_message\":\"[{exception_id}] {exception_url}   PHP Fatal Error from line 148 of \/srv\/mediawiki\/w\/fatal-error.php: Stack overflow\",\"phpversion\":\"5.6.99-hhvm\",\"url\":\"\/w\/fatal-error.php?password=redacted&action=segfault&who=keith\",\"ip\":\"172.16.4.21\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.beta.wmflabs.org\",\"referrer\":null,\"unique_id\":\"XNMw6KwQBHcAACGLaLoAAAAJ\",\"wiki\":\"enwiki\",\"mwversion\":\"1.34.0-alpha\",\"reqId\":\"XNMw6KwQBHcAACGLaLoAAAAJ\",\"fatal_exception\":{\"class\":\"ErrorException\",\"message\":\"PHP Fatal Error: Stack overflow\",\"code\":16777217,\"file\":\"\/srv\/mediawiki\/w\/fatal-error.php\",\"line\":148,\"trace\":\"#0 \/srv\/mediawiki\/w\/fatal-error.php(148): NO_FUNCTION_GIVEN()\\n#1 [internal function]: CauseFatalError::doSegfault(integer)\\n#2 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#3 [internal function]: CauseFatalError::doSegfault(integer)\\n#4 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#5 [internal function]: CauseFatalError::doSegfault(integer)\\n#6 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#7 [internal function]: CauseFatalError::doSegfault(integer)\\n#8 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#9 [internal function]: CauseFatalError::doSegfault(integer)\\n#10 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#11 [internal function]: CauseFatalError::doSegfault(integer)\\n#12 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#13 [internal function]: CauseFatalError::doSegfault(integer)\\n#14 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#15 [internal function]: CauseFatalError::doSegfault(integer)\\n#16 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#17 [internal function]: CauseFatalError::doSegfault(integer)\\n#18 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#19 [internal function]: CauseFatalError::doSegfault(integer)\\n#20 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#21 [internal function]: CauseFatalError::doSegfault(integer)\\n#22 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#23 [internal function]: CauseFatalError::doSegfault(integer)\\n#24 <snip>"
}

Looks like we have a formatting issue here in addition to the length limit. Likely occurring when the log is emitted by mediawiki, or within the localhost rsyslog. I'm hoping @fgiunchedi can have a look as well.

I think we'll want to sort formatting out first. If we were to truncate this as-is within logstash we'll have broken raw json in a "msg" field while missing several other useful fields.

To figure out if it is MW-related compare to a message from a different fake fatal error (not action=segfault, but action=oom) which produces a shorter message that is otherwise formatted by the same code.

herron added a comment.May 9 2019, 5:58 PM

After further testing I'm seeing these messages are arriving to rsyslog with @cee formatting, but truncated. Meaning the msg field does not contain valid json, specifically within the json-in-json-in-json field msg.fatal_exception.trace. The msg field comes from rsyslog extraction of the json payload prefixed by the @cee cookie in the syslog message.

Because we are nesting json I don't think enforcing a max field length at rsyslog or logstash is the best solution. Neither are able to parse the invalid json message in order to shorten a specific field. So we would end up with still broken, but shorter, raw json indexed by ELK as a single field.

Could we set a max length on the fatal_exception.trace field within Mediawiki?

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.
For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.
But the immediate issue is to be able to at least index them and detect the problem.

Ok, that's fair. And I did have success testing the logstash-filter-truncate plugin in beta, so we should be able to use this plugin in prod as long as it doesn't melt under the load. I'll work on an updated logstash plugin bundle and a config to get that deployed.

herron moved this task from Backlog to Working on on the User-herron board.

Change 509880 had a related patch set uploaded (by Herron; owner: Herron):
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 509880 merged by Herron:
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 merged by Herron:
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 486840 restored by Tim Starling:
WIP: port to PHP 7.2

Reason:
Actually Platform is interested

https://gerrit.wikimedia.org/r/486840

Change 511627 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Change 511627 merged by jenkins-bot:
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

Also, catchable fatal errors are displayed in an ugly way: the format is now "Uncaught %s\n thrown", where %s is the entire exception object converted to string, including the multi-line backtrace. So wmerrors log entries for catchable fatal errors now have the backtrace twice. That's fixable.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts
  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.
  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

For flexibility and to avoid duplication, ideally error handling would just be done in userspace. A comment in the PHP code says that it's unsafe to run userspace code when handling fatal errors. That's not quite as true as it used to be, but I still have some concerns. PHP 7 runs userspace code after zend_bailout(), which wipes the current execution stack. Perhaps if you run userspace code without first wiping the stack, the code would be exposed to stack corruption due to an OOM occurring in the middle of stack modification. We could just try it and see how often it crashes.

You can certainly run userspace code on timeout, that's basically what we're doing in Excimer.

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

Your comment inspired me to track down the bug. Turns out it was just some code that was still checking for Exception only rather than Throwable.

Change 512179 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 512179 merged by jenkins-bot:
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 486840 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Port to PHP 7.2

https://gerrit.wikimedia.org/r/486840

Change 512533 had a related patch set uploaded (by Krinkle; owner: Anomie):
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Change 512533 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Mentioned in SAL (#wikimedia-operations) [2019-05-26T13:37:39Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/debug: T187147 / 2be7aa4bc4af36 (duration: 00m 51s)

jijiki added a subscriber: jijiki.May 27 2019, 6:10 AM
Joe added a comment.May 27 2019, 3:46 PM

Basic porting work on wmerrors is hopefully complete.
It still writes a text-based log entry to a socket. I suppose JSON is needed?

I would say reproduce the behaviour of HHVM: uncatchable fatal errors are logged to syslog by HHVM directly, with log level error. We generally require services to use the cee format which sends indeed a json structure, which makes the message way more parsable/indexable on logstash.

See https://wikitech.wikimedia.org/wiki/Logstash/SRE_onboard for a brief introduction on how logging from clients is supposed to work.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts

Correct. Timeouts (the ones where Excimer can't timeout on its own) and OOMs should be handled by wmerrors as php-fpm is not able to handle them itself besides logging an error. We're interested not only in handling the message best we can, but also in adding a backtrace when possible.

  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

we'd like probably to be able to use the same php page for both? That's currently modules/mediawiki/templates/hhvm-fatal-error.php.erb in ops/puppet

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

AIUI this was fixed by @Anomie's patch already?

Example from T224491 that lacked the original stack trace.

PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}

PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

Request ID: XO1PdwpAADkAAAL-DwYAAABE
Request   : GET enwiki /wiki/{…} (page view)
herron updated the task description. (Show Details)Jun 4 2019, 3:19 PM
herron added a comment.Jun 4 2019, 3:32 PM

Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."

These will no longer be dropped by ELK, but very long messages will still contain truncated raw JSON. We would need to truncate the fatal_exception.trace field as it is emitted by mw to address that afaict.

Change 516763 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Change 516975 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516975

Change 516975 abandoned by Tim Starling:
Add a fatal error page to go with the proposed wmerrors feature

Reason:
Moving to puppet

https://gerrit.wikimedia.org/r/516975

Change 516988 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

Joe moved this task from Backlog to Doing on the serviceops board.Jun 21 2019, 10:19 AM
Joe added a comment.Jun 21 2019, 10:25 AM

@Legoktm kindly did my job and created an upstream package for wmerrors. Thanks a lot!

@tstarling I see a few patches for wmerrors are open, should I wait for any of those to be merged before creating a package?

If not, I think we can test wmerrors early next week and complete a deployment to production in a few days.

Joe moved this task from Doing to Externally Blocked on the serviceops board.Jun 24 2019, 6:33 AM

Change 516763 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Joe moved this task from Externally Blocked to Doing on the serviceops board.Jun 27 2019, 4:54 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-28T11:04:24Z] <_joe_> uploading php-wmerrors to thirdparty/php72 - T187147

Change 516988 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988

Change 519961 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

Change 519961 merged by Giuseppe Lavagetto:
[operations/puppet@production] wmerrors: enable on the mwdebug servers

https://gerrit.wikimedia.org/r/519961

Joe added a comment.EditedJul 1 2019, 10:47 AM

After installing wmerrors on the test servers, these are my results:

  • OOM errors are now correctly treated: we get the correct error page and the trace is sent to fatal.log
  • Timeout errors are not treated correctly : we get a semi-blank error page (which I think is the default exception page for MediaWiki) and the trace is sent to exception.log, but it contains only a reference to the excimer Timeout exception (not ideal).
  • Nomethod errors behave like timeout errors - they get sent to exception.log and show the default error page
  • Segfaults nothing has changed - we get no stack trace anywhere and the error page has no information on what went wrong.

The result for segfaults doesn't surprise me - because of the way php-fpm works, the whole worker process crashes and thus there is no way to handle failure elsewhere. But I think for the rest of the uncatchable fatals (like OOMs) wmerrors solved the issue.

I think we need to:

  • Configure a better-looking error page to be shown for exceptions
  • Maybe(?) modify our logging so that fatal exceptions are logged to fatal.log, which looks like the obvious place where this should happen.

I will proceed and deploy wmerrors fleet-wide, and update the task description.

@tstarling (or anyone else) any suggestion on how to proceed? I would imagine that letting wmerrors handle catchable fatals would be ok? Or do we prefer to properly configure how we report them (I'm not even sure if that's possible)

Joe updated the task description. (Show Details)Jul 1 2019, 10:51 AM

Change 519986 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519986 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: install wmerrors everywhere

https://gerrit.wikimedia.org/r/519986

Change 519990 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mediawiki::php: ensure sapis error is correct for wmerrors:

https://gerrit.wikimedia.org/r/519990

Change 519990 abandoned by Jbond:
mediawiki::php: ensure sapis error is correct for wmerrors:

Reason:
not the correct fix

https://gerrit.wikimedia.org/r/519990

Joe added a comment.Jul 2 2019, 10:01 AM

Talking with @Krinkle I realized that the reason why I saw that ugly error message is because the endpoint doesn't initialize the skin at all. Actual fatals on real pages should include the skin.

Also, having those fatals in exception.log is not a problem apparently.

jijiki moved this task from Doing to Next up on the serviceops board.Jul 12 2019, 10:44 AM

Is this blocking deployment of PHP 7?

Joe added a comment.Mon, Jul 22, 8:38 AM

Is this blocking deployment of PHP 7?

In my opinion, it should not at this point. But I'd like to hear @Krinkle's opinion too.

The first point was blocking (proper traces), and has been checked off. The remaining work should not be blocking in my opinion.

I've now confirmed the traces for Logstash as well:

  • oom: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • nomethod: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • timeout: HHVM to logstash/mw/fatal, PHP7 to logstash/mw/exception. Both with rich meta data and proper traces.
  • segfault: HHVM (nothing). PHP7 tries to emit to logstash/mw/fatal and a truncated version of this message reaches Logstash under _type:mediawiki/channel:jsonTruncated.

LGTM.

Joe edited projects, added serviceops-radar; removed serviceops.Thu, Jul 25, 1:44 PM

What's the current status of this task? Are there needs from CPT?