Page MenuHomePhabricator

Port mediawiki/php/wmerrors to PHP7 and deploy
Open, HighPublic

Description

Before sending it to Cleanup, is the wmerrors PHP extension still useful? In T56163: Port wmerrors / php-fatal-error.html to HHVM it was decided that we didn't need this for HHVM, will we need it again for PHP 7? Or can it be safely archived?

Summary

See research at T187147#4817837 and T187147#5165179.

Good:

  • Recoverable and unrecoverable fatal errors have events emitted on PHP 7.2 similar to how HHVM provided these (except PHP7 uses a Throwable as uncaught exception, instead of an Error event; MediaWiki already tracks both).

Problems:

  • No stack traces, at all, for PHP 7 fatal errors. HHVM provided these through a custom mechanism. For PHP 5, we had a custom extension. We might need that again, or find a different clever workaround if possible.
  • Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated. More info at T187147#5165518. (fixed with https://phabricator.wikimedia.org/T187147#5182892)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@tstarling If I understand correctly from Joe and T176370, the the current aim is to use the default PHP 7.0 package during early internal experimentation, and PHP 7.2 (or 7.3) for the user-facing roll out.

That presumably means we can go with user-land handling, right? Regarding something basic, I suppose the existing transport via syslog for HHVM could be re-purposed for this. Something like HHVMRequestInit.php, but using the shutdown/trace logic from MWExceptionHandler.php, and using PHP7's auto-prepend to run it?

sietec added a subscriber: sietec.Nov 28 2018, 2:50 AM

Change 478575 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] errorpages: Remove unused php-fatal-error.html file

https://gerrit.wikimedia.org/r/478575

I checked a variety of fatal errors on both production HHVM and our current php-fpm configuration using the fatal-errors.php script from T210567. Here's what I found. (Disclaimer: I'm still relatively new to our error logging, so there may be something obvious that I missed.)

ErrorRuntimePost-sendStack Trace [1]Logstash ChannelLogfile(s)
No methodHHVMnoyesfatalfatal.log and hhvm.log
No methodHHVMyesyesfatalfatal.log and hhvm.log
No methodphp-fpmnonoexceptionexception.log
No methodphp-fpmyesnoexceptionexception.log
Out of memoryHHVMnoyesfatalfatal.log and hhvm.log
Out of memoryHHVMyesyesfatalfatal.log and hhvm.log
Out of memoryphp-fpmnoyes[2]fatalfatal.log and php7.2-fpm.log
Out of memoryphp-fpmyesyes[2]fatalfatal.log and php7.2-fpm.log
TimeoutHHVMnoyesfatalfatal.log and hhvm.log
TimeoutHHVMyesyesfatalfatal.log and hhvm.log
Timeoutphp-fpmnoyes[2]fatalfatal.log and php7.2-fpm.log
Timeoutphp-fpmyesyes[2]fatalfatal.log and php7.2-fpm.log
SegfaultHHVMnoyes<none>fatal.log and hhvm.log
SegfaultHHVMyesyes<none>fatal.log and hhvm.log
Segfaultphp-fpmnono<none>php7.2-fpm.log
Segfaultphp-fpmyesno<none>php7.2-fpm.log

[1] all stack traces appear in fatal.log only, not in hhvm.log / php7.2-fpm.log, which is fine
[2] the trace is to MWExceptionHandler::handleFatalError only, not to the code that actually caused the error, which isn't so fine

To summarize the differences that I noticed:

  • channels/logs changed for one error type (method not found, which is likely representative of a number of different PHP errors), and it lost its stack trace
  • stack traces for out of memory and timeout are much less useful under php-fpm
  • segfaults are not recorded in fatal.log under php-fpm. Although they are recorded in php7.2-fpm.log, they have no stack trace.

We identified this task as a likely candidate for an outside contractor to complete, based on current team structure and availability. The goal is to restore logging of stack traces that would be lost in a switch to PHP7 / php-fpm.

Change 478575 merged by jenkins-bot:
[operations/mediawiki-config@master] errorpages: Remove unused php-fatal-error.html file

https://gerrit.wikimedia.org/r/478575

Change 486840 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/php/wmerrors@master] WIP: port to PHP 7.2

https://gerrit.wikimedia.org/r/486840

Change 486840 abandoned by MaxSem:
WIP: port to PHP 7.2

Reason:
Platform is not interested in this.

https://gerrit.wikimedia.org/r/486840

demon removed a subscriber: demon.Feb 19 2019, 10:30 AM

Still seeing fatals wrongly reported as regular (catchable) exceptions for PHP 7.2, which made T220623 rather confusing to debug.

Krinkle moved this task from Backlog to Wikimedia production on the PHP 7.2 support board.
Joe added a subscriber: Joe.Apr 17 2019, 8:34 AM

Is this task resolved? If not, do we consider this a blocker for a full-scale deployment of PHP 7?

I am not aware of any resolution or progress on this task (beyond @MaxSem 's abandoned patch above).

I do not know if we consider this a blocker for PHP 7 deployment, but it seems to me worth doing. As @Krinkle mentions, the current situation is less than ideal and will continue to be problematic from time to time.

See also T217846 which is making this harder to test.

Joe added a comment.May 6 2019, 2:58 PM

So a couple things:

  • fatals now show up on logstash, via the php-fpm error log that we ship to logstash. Thus I was able to add them to fatalmonitor.
  • I will look into T217846 but to debug this you can probably bypass nginx using curl?
Krinkle added a comment.EditedMay 6 2019, 3:56 PM

So a couple things:

  • fatals now show up on logstash, via the php-fpm error log that we ship to logstash. Thus I was able to add them to fatalmonitor.

Which fatalmonitor?

  1. The "Fatalmonitor" dashboard in Logstash.
  2. The "mediawiki-errors" dashboard in Logstash, which queries type:mediawiki, channel:(fatal, exception, error).
  3. Scap canary_checks (uses puppet/logstash_checker.py), which currently queries type:hhvm and type:mediawiki.
  4. Graphite query used by Icinga to detect MediaWiki incidents (with associated Grafana dash).

This last two are the ones that matters most (due to use by automation), and do not appear to have been updated.

The second one is what the majority of MediaWiki-related dashboards are based on. Logging it to syslog means no team or product owner will find or detect these errors currently.

If making these work from the MW side is not feasible in the short term, perhaps we can workaround in the short-term by remapping it at the ingestion point with a logstash-filter like we used to have for other channels? This filter would make sure they end up in type:mediawiki phpversion:php72 … with channel:(error OR exception OR fatal).

Krinkle added a comment.EditedMay 7 2019, 5:19 PM

I've reformatted the table from @BPirkle in T187147#4817837 to compare PHP7/HHVM side-by-side:

ScenarioHHVMphp72-fpmStatus
Fatal: Method undefinedMW-Logstash (fatal) with stack trace, and syslogMW-Logstash (exception), and syslog. No trace. Lost stack trace. Still has MW context, although wrong channel but less of an issue.
Fatal: Method undefined (Post-send)MW-Logstash (fatal) with stack trace, and syslogMW-Logstash (exception), and syslog. No trace. Lost stack trace.
Fatal: Out of memoryMW-Logstash (fatal) with stack trace, and syslogMW-Logstash (fatal). No trace[2]. Lost stack trace. (and lost syslog, but not important.)
Fatal: Out of memory (Post-send)MW-Logstash (fatal) with stack trace, and syslogMW-Logstash (fatal), and syslog. No trace[2]. Lost stack trace.
Fatal: TimeoutMW-Logstash (fatal) with stack trace, and syslogMW-Logstash (fatal), and syslog. No trace[2]. Lost stack trace.
Fatal: Timeout (Post-send)MW-Logstash (fatal) with stack trace, and syslogMW-Logstash (fatal), and syslog No trace[2]. Lost stack trace.
Fatal: Segfaultsyslog (no trace), and fatal.log on disk only<none> Lost all telemetry. (Not even in php7.2-fpm.log on disk).
Fatal: Segfault (Post-send)syslog (no trace), and fatal.log on disk only<none> Lost all telemetry.
  • "MW-Logstash" means emitted from MediaWiki with rich logging request context, to both Logstash under type:mediawiki, and to a log file with the channel name as filename.
  • "syslog" means emitted from HHVM or php72-fpm directly without MW context (no request url, wiki, request ID, job runner context, etc.). Sent to Logstash under type:syslog or type:hhvm and to hhvm.log/php7.2-fpm.log on disk.
  • [2] This ones actually do have a trace specified, but it's just 1 line saying "MWExceptionHandler::handleFatalError", not the original trace.

There are two differences from Bill's analysis:

  • I re-tested the segfaults as I found it suspicious that HHVM wouldn't log its entries to Logstash. And confirmed that HHVM does indeed have its syslog sent to Logstash as well (Example below). The MW-formatted fatal.log entry is indeed on-disk only, which appears to be a recent regression in our HHVM/Logstash config as this did work originally.
2019-05-07T16:55:52hhvmmw1248-FATALFatal error: Stack overflow in /srv/mediawiki/w/fatal-error.php on line 148
  • I've classified "Stacktrace: Yes[2]" as "Stacktrace: No[2]". Because the empty placeholder that says "MWExceptionHandler::handleFatalError" isn't a real stack trace.
Conclusions

I previously thought that the main issue was that fatals are only be logged to Syslog. But, looking at this again this doesn't appear to be the case.

Rather the main problem is that we've lost stack traces from all failure scenarios. In the more obscure scenario for Segfaults, the issue is actually complete less of telemetry as far as I could tell. Not even to php7.2-fpm.log, although telemetry wasn't great previously here, either and certainly not monitored by anyone or anything (except for HTTP 500 monitoring, which still works.)

Next steps
  1. 1) At the very least I think we need to get stack traces back so that we can still meaningfully debug problems as they come up.
  2. 2) The fact that HHVM is logging its segfault fatals with nice MW context to mwlog1001/fatal.log, but not to Logstash is suspicious to me. As far as I know, this used to work until recently. I suspect maybe recent work on logstash-filter refactoring may have broken this ingestion from HHVM syslog into Logstash.

The "Method undefined" fatals are classified as run-time exceptions as of PHP 7. This isn't a big deal, and is actually expected. I've uploaded https://gerrit.wikimedia.org/r/508653 to document why this is.

Change 508653 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception: Document the three ways we listen for errors/fatals/exceptions

https://gerrit.wikimedia.org/r/508653

Krinkle added a comment.EditedMay 7 2019, 6:31 PM
  1. The fact that HHVM is logging its segfault fatals with nice MW context to mwlog1001/fatal.log, but not to Logstash is suspicious to me. As far as I know, this used to work until recently. I suspect maybe recent work on logstash-filter refactoring may have broken this ingestion from HHVM syslog into Logstash.

Using /fatal-error.php?action=segfault as example, on mwdebug1001, I can see that it correctly reaches MWExceptionHandler::handleFatalError and performs $logger = getIntance('fatal'); $logger->error( $msg, [ ... ] );. And I've also confirmed that the two Monolog handlers we expect for this, are indeed both registered and called at run-time. First LegacyHandler for udp to mwlog1001, and then SyslogHandler for Logstash.

Except the latter never arrives. Instead, I see the following in syslog:

May 7 18:27:08 mwdebug1001 rsyslog_exporter[5600]: 2019/05/07 18:27:08 error handling stats line: unknown pstat type: 0, line was: 2019-05-07T18:27:08.095311+00:00 mwdebug1001 rsyslogd-pstats: { "name": "imudp(w0)", "origin": "imudp", "called.recvmmsg": 338, "called.recvmsg": 0, "msgs.received": 171 }

And this repeats itself regularly, until I trigger the fatal again at which point the number 171 changes to 172.

This suggests MediaWiki is actually sending something, but either the @cee syslog format MW uses is incorrect, or something is wrong with rsyslog.

EDIT: Ah, this is a red herring. These are just stats for Prometheus. So I guess then the message is sent fine, but lost somewhere in the ELK stack?

herron added a subscriber: herron.May 7 2019, 7:43 PM

Seeing errors like this from logstash that appear related. This one specifically originated from logstash1007 /var/log/logstash/logstash-plain.log

[2019-05-07T19:27:01,456][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2019.05.07", :_type=>"mediawiki", :_routing=>nil}, 2019-05-07T19:27:01.126Z mwdebug1001 %{message}], :response=>{"index"=>{"_index"=>"logstash-2019.05.07", "_type"=>"mediawiki", "_id"=>"AWqTw3Q53aPtkd7P_tdi", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Document contains at least one immense term in field=\"msg.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[32, 123, 34, 64, 116, 105, 109, 101, 115, 116, 97, 109, 112, 34, 58, 34, 50, 48, 49, 57, 45, 48, 53, 45, 48, 55, 84, 49, 57, 58]...', original message: bytes can be at most 32766 in length; got 64976", "caused_by"=>{"type"=>"max_bytes_length_exceeded_exception", "reason"=>"max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 64976"}}}}}

Seems we have some tuning to do with regard to the message length of these logs, but not sure yet the best course of action to fix.

Change 508653 merged by jenkins-bot:
[mediawiki/core@master] exception: Document the three ways we listen for errors/fatals/exceptions

https://gerrit.wikimedia.org/r/508653

Krinkle added a comment.EditedMay 7 2019, 8:17 PM

These kinds of oversized messages used to be trimmed between (r)syslog and Logstash, and we still have a logstash-filter for it at puppet: logstash/filter-syslog.conf. However, this no longer works because it seems the transport between MediaWiki and Logstash (Monolog, rsyslog, kafka) no longer appears to have any limit (or at least not a limit we are exceeding here).

As such, the oversized values actually arrive at Logstash in full, and are run through all the filters as normal (including the above filter-syslog.conf, but mostly a no-op for our purposes), but then rejected per T187147#5165518 due to being too big for the ElasticSearch index.

So.. I guess we could update the logstash-filter to instead mutate the message after json-parsing to trim any fields that exceeds the limit Logstash can accept in its term index (and as before, tag syslog_truncated to record this fact).

Krinkle updated the task description. (Show Details)May 8 2019, 2:07 PM
kchapman added a subscriber: kchapman.

CPT needs to review and plan for long term fixes to this.

herron added a comment.May 8 2019, 8:21 PM

Comparing (in beta) a working mediawiki log message and a log message failing with max_bytes_length_exceeded_exception I'm noticing differences in json formatting as well. For example

Working log (as read from kafka):

{
  "timestamp": "2019-05-08T19:40:25+00:00",
  "logsource": "deployment-jobrunner03",
  "host": "deployment-jobrunner03",
  "program": "mediawiki",
  "severity": "info",
  "facility": "user",
  "@timestamp": "2019-05-08T19:40:25.000000+00:00",
  "@version": 1,
  "message": "MessageCache::load: Loading en... got from local cache",
  "type": "mediawiki",
  "channel": "MessageCache",
  "level": "INFO",
  "shard": "s3",
  "normalized_message": "MessageCache::load: Loading en... got from local cache",
  "phpversion": "5.6.99-hhvm",
  "url": "\/rpc\/RunSingleJob.php",
  "ip": "172.16.4.124",
  "http_method": "POST",
  "server": "deployment-jobrunner03.deployment-prep.eqiad.wmflabs",
  "referrer": null,
  "wiki": "wikidatawiki",
  "mwversion": "1.34.0-alpha",
  "reqId": "6b7a56ab568d318dbdb55df5",
  "private": false
}

Log message causing logstash/elasticsearch max_bytes_length_exceeded_exception triggered by fatal-error.php (as read from kafka, truncated manually at <snip>)

{
  "timestamp": "2019-05-08T19:41:28+00:00",
  "logsource": "deployment-mediawiki-07",
  "host": "deployment-mediawiki-07",
  "program": "mediawiki",
  "severity": "err",
  "facility": "user",
  "msg": " {\"@timestamp\":\"2019-05-08T19:41:28.000000+00:00\",\"@version\":1,\"host\":\"deployment-mediawiki-07\",\"message\":\"[XNMw6KwQBHcAACGLaLoAAAAJ] \/w\/fatal-error.php?password=redacted&action=segfault&who=keith   PHP Fatal Error from line 148 of \/srv\/mediawiki\/w\/fatal-error.php: Stack overflow\",\"type\":\"mediawiki\",\"channel\":\"fatal\",\"level\":\"ERROR\",\"shard\":\"s3\",\"normalized_message\":\"[{exception_id}] {exception_url}   PHP Fatal Error from line 148 of \/srv\/mediawiki\/w\/fatal-error.php: Stack overflow\",\"phpversion\":\"5.6.99-hhvm\",\"url\":\"\/w\/fatal-error.php?password=redacted&action=segfault&who=keith\",\"ip\":\"172.16.4.21\",\"http_method\":\"GET\",\"server\":\"en.wikipedia.beta.wmflabs.org\",\"referrer\":null,\"unique_id\":\"XNMw6KwQBHcAACGLaLoAAAAJ\",\"wiki\":\"enwiki\",\"mwversion\":\"1.34.0-alpha\",\"reqId\":\"XNMw6KwQBHcAACGLaLoAAAAJ\",\"fatal_exception\":{\"class\":\"ErrorException\",\"message\":\"PHP Fatal Error: Stack overflow\",\"code\":16777217,\"file\":\"\/srv\/mediawiki\/w\/fatal-error.php\",\"line\":148,\"trace\":\"#0 \/srv\/mediawiki\/w\/fatal-error.php(148): NO_FUNCTION_GIVEN()\\n#1 [internal function]: CauseFatalError::doSegfault(integer)\\n#2 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#3 [internal function]: CauseFatalError::doSegfault(integer)\\n#4 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#5 [internal function]: CauseFatalError::doSegfault(integer)\\n#6 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#7 [internal function]: CauseFatalError::doSegfault(integer)\\n#8 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#9 [internal function]: CauseFatalError::doSegfault(integer)\\n#10 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#11 [internal function]: CauseFatalError::doSegfault(integer)\\n#12 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#13 [internal function]: CauseFatalError::doSegfault(integer)\\n#14 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#15 [internal function]: CauseFatalError::doSegfault(integer)\\n#16 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#17 [internal function]: CauseFatalError::doSegfault(integer)\\n#18 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#19 [internal function]: CauseFatalError::doSegfault(integer)\\n#20 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#21 [internal function]: CauseFatalError::doSegfault(integer)\\n#22 \/srv\/mediawiki\/w\/fatal-error.php(148): array_map(string, array)\\n#23 [internal function]: CauseFatalError::doSegfault(integer)\\n#24 <snip>"
}

Looks like we have a formatting issue here in addition to the length limit. Likely occurring when the log is emitted by mediawiki, or within the localhost rsyslog. I'm hoping @fgiunchedi can have a look as well.

I think we'll want to sort formatting out first. If we were to truncate this as-is within logstash we'll have broken raw json in a "msg" field while missing several other useful fields.

To figure out if it is MW-related compare to a message from a different fake fatal error (not action=segfault, but action=oom) which produces a shorter message that is otherwise formatted by the same code.

herron added a comment.May 9 2019, 5:58 PM

After further testing I'm seeing these messages are arriving to rsyslog with @cee formatting, but truncated. Meaning the msg field does not contain valid json, specifically within the json-in-json-in-json field msg.fatal_exception.trace. The msg field comes from rsyslog extraction of the json payload prefixed by the @cee cookie in the syslog message.

Because we are nesting json I don't think enforcing a max field length at rsyslog or logstash is the best solution. Neither are able to parse the invalid json message in order to shorten a specific field. So we would end up with still broken, but shorter, raw json indexed by ELK as a single field.

Could we set a max length on the fatal_exception.trace field within Mediawiki?

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.

For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.

But the immediate issue is to be able to at least index them and detect the problem.

Ok, that's fair. And I did have success testing the logstash-filter-truncate plugin in beta, so we should be able to use this plugin in prod as long as it doesn't melt under the load. I'll work on an updated logstash plugin bundle and a config to get that deployed.

herron moved this task from Backlog to Working on on the User-herron board.

Change 509880 had a related patch set uploaded (by Herron; owner: Herron):
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 509880 merged by Herron:
[operations/software/logstash/plugins@master] logstash: add logstash-filter-truncate plugin

https://gerrit.wikimedia.org/r/509880

Change 509924 merged by Herron:
[operations/puppet@production] logstash: enforce max length on "message" and "msg" fields

https://gerrit.wikimedia.org/r/509924

Change 486840 restored by Tim Starling:
WIP: port to PHP 7.2

Reason:
Actually Platform is interested

https://gerrit.wikimedia.org/r/486840

Change 511627 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Change 511627 merged by jenkins-bot:
[integration/config@master] wmerrors is moving to PHP 7 exclusively

https://gerrit.wikimedia.org/r/511627

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

Also, catchable fatal errors are displayed in an ugly way: the format is now "Uncaught %s\n thrown", where %s is the entire exception object converted to string, including the multi-line backtrace. So wmerrors log entries for catchable fatal errors now have the backtrace twice. That's fixable.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts
  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.
  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

For flexibility and to avoid duplication, ideally error handling would just be done in userspace. A comment in the PHP code says that it's unsafe to run userspace code when handling fatal errors. That's not quite as true as it used to be, but I still have some concerns. PHP 7 runs userspace code after zend_bailout(), which wipes the current execution stack. Perhaps if you run userspace code without first wiping the stack, the code would be exposed to stack corruption due to an OOM occurring in the middle of stack modification. We could just try it and see how often it crashes.

You can certainly run userspace code on timeout, that's basically what we're doing in Excimer.

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

Your comment inspired me to track down the bug. Turns out it was just some code that was still checking for Exception only rather than Throwable.

Change 512179 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 512179 merged by jenkins-bot:
[mediawiki/core@master] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512179

Change 486840 merged by jenkins-bot:
[mediawiki/php/wmerrors@master] Port to PHP 7.2

https://gerrit.wikimedia.org/r/486840

Change 512533 had a related patch set uploaded (by Krinkle; owner: Anomie):
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Change 512533 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] logger: Produce traces for all Throwables

https://gerrit.wikimedia.org/r/512533

Mentioned in SAL (#wikimedia-operations) [2019-05-26T13:37:39Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/debug: T187147 / 2be7aa4bc4af36 (duration: 00m 51s)

jijiki added a subscriber: jijiki.Mon, May 27, 6:10 AM
Joe added a comment.Mon, May 27, 3:46 PM

Basic porting work on wmerrors is hopefully complete.

It still writes a text-based log entry to a socket. I suppose JSON is needed?

I would say reproduce the behaviour of HHVM: uncatchable fatal errors are logged to syslog by HHVM directly, with log level error. We generally require services to use the cee format which sends indeed a json structure, which makes the message way more parsable/indexable on logstash.

See https://wikitech.wikimedia.org/wiki/Logstash/SRE_onboard for a brief introduction on how logging from clients is supposed to work.

The whole thing is quite duplicative of code we have in MWExceptionHandler.php. I gather the reasons for using wmerrors are:

  • To provide backtraces for OOMs and timeouts

Correct. Timeouts (the ones where Excimer can't timeout on its own) and OOMs should be handled by wmerrors as php-fpm is not able to handle them itself besides logging an error. We're interested not only in handling the message best we can, but also in adding a backtrace when possible.

  • For error display. Currently PHP 7 in production gives a 500 error with an empty body, which is converted by Varnish to a generic error page, whereas HHVM tells you what actually happened. wmerrors gives a customisable error page.

we'd like probably to be able to use the same php page for both? That's currently modules/mediawiki/templates/hhvm-fatal-error.php.erb in ops/puppet

  • To provide backtraces for catchable fatals. I don't know why this is hard to do. It's probably just some bug in MWExceptionHandler that stops this from happening.

AIUI this was fixed by @Anomie's patch already?

Example from T224491 that lacked the original stack trace.

PHP Fatal Error: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}

PHP Fatal Error from line 29 of /srv/mediawiki/php-1.34.0-wmf.6/vendor/wikibase/data-model/src/Entity/Item.php: Interface 'Wikibase\DataModel\Statement\StatementListHolder' not found

Request ID: XO1PdwpAADkAAAL-DwYAAABE
Request   : GET enwiki /wiki/{…} (page view)
herron updated the task description. (Show Details)Tue, Jun 4, 3:19 PM
herron added a comment.Tue, Jun 4, 3:32 PM

Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."

These will no longer be dropped by ELK, but very long messages will still contain truncated raw JSON. We would need to truncate the fatal_exception.trace field as it is emitted by mw to address that afaict.

Change 516763 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/php/wmerrors@master] Add wmerrors.error_script_file

https://gerrit.wikimedia.org/r/516763

Change 516975 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516975

Change 516975 abandoned by Tim Starling:
Add a fatal error page to go with the proposed wmerrors feature

Reason:
Moving to puppet

https://gerrit.wikimedia.org/r/516975

Change 516988 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/puppet@production] Add a fatal error page to go with the proposed wmerrors feature

https://gerrit.wikimedia.org/r/516988