Page MenuHomePhabricator

Log php fatals with full backtraces again (fatal.log on fluorine)
Closed, ResolvedPublic

Description

Right now most of the fatals that occur are being logged in hhvm.log in a pretty meaningless way, especially for things like out of memory errors. We need to get the traces back by logging them from within MediaWiki PHP where possible:

Patches: https://gerrit.wikimedia.org/r/#/q/message:T89169

This task is about fatals, for PHP errors, see T45086.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 202741 had a related patch set uploaded (by Legoktm):
Add fatal log group

https://gerrit.wikimedia.org/r/202741

Progress! We now have logs that look like:

2015-04-08 15:08:56 mw1254 commonswiki: [277a894f] /wiki/Blahblahblah   ErrorException from line 262 of /srv/mediawiki/php-1.25wmf24/includes/exception/MWExceptionHandler.php: Fatal Error: request has exceeded memory limit
#0 (): MWExceptionHandler::handleFatalError()
#1 {main}

Progress! We now have logs that look like:

2015-04-08 15:08:56 mw1254 commonswiki: [277a894f] /wiki/Blahblahblah   ErrorException from line 262 of /srv/mediawiki/php-1.25wmf24/includes/exception/MWExceptionHandler.php: Fatal Error: request has exceeded memory limit
#0 (): MWExceptionHandler::handleFatalError()
#1 {main}

It looks like HHVM is calling handleFatalError() after the stack has been unwound. This isn't terribly surprising when you notice that handleFatalError() is installed as a shutdown function. I'm afraid that if we want full stack traces like we had in PHP5 we are going to need to make a custom extension (or HHVM patch) that does something like our wmerrors PHP5 extension contrary to the initial analysis done in T56163.

not sure if this is related, but suddenly I'm seeing this on fatalmonitor:

Cannot modify header information - headers already sent in /srv/mediawiki/hhvm-fatal-error.php on line 814

Today I saw a stacktrace in fatal.log, (T103716#1397193), but looking through the log I also found other fatals without stacktraces :(

Change 233588 had a related patch set uploaded (by BryanDavis):
Add support for multiple error handlers

https://gerrit.wikimedia.org/r/233588

Change 233589 had a related patch set uploaded (by BryanDavis):
Enhance stacktrace logging for fatals

https://gerrit.wikimedia.org/r/233589

Change 233590 had a related patch set uploaded (by BryanDavis):
Enhance debug log output for stacktraces

https://gerrit.wikimedia.org/r/233590

Change 233588 abandoned by BryanDavis:
Add support for multiple error handlers

Reason:
Too fragile. There's probably a reason that PHP doesn't ship with the ability to do this. :/

https://gerrit.wikimedia.org/r/233588

I spent some time last weekend digging around in the HHVM source code to see what our options might be for getting better stacktrace output.

My patch at https://gerrit.wikimedia.org/r/#/c/233589/ tries to make use of one un(der)documented feature of HHVM's callbacks to set_error_handler() handlers. HHVM passes a 6th argument containing a stacktrace to the callback. This when combined with the hhvm.error_handling.call_user_handler_on_fatals ini setting will dispatch fatal errors to a callback with all the information we need to make a nice log entry.

We had discovered this before, but since PHP and HHVM only install one user defined error handler at a time the callback in HHVMRequestInit.php is replaced by the callback installed by MWExceptionHandler. This same problem is actually also biting us in Hooks::run which installs its own error handler callback to convert some errors into exceptions. This effectively made us blind to fatal errors that happened inside a hook.

The substantive changes in https://gerrit.wikimedia.org/r/#/c/233589/ are:

  • Use stacktrace provided to callback rather than creating a new one via Exception construction
  • Log using a context that looks like Monolog's internal expansion of an Exception rather than an actual exception
  • Make the error handler in Hooks chain to the default one in MWExceptionHandler for errors that are not converted to Exceptions.
  • Stop double logging fatals via error handler and shutdown hook but remembering when the error handler has seen a fatal.

Other than some possible formatting tweaks I think this is about as far aw we can go with the built-in tools of HHVM. I hope it will be good enough for our needs. If it's not, I also figured out that HHVM's built-in logger system is modular and could be replaced with a custom C++ class that did other logging things. The existing ExtendedLogger doesn't add things that we would want but it shows how it could be done.

A few other things I discovered about HHVM's logger:

  • Syslog mode (which we use) only provides the basic log message and most of the hhvm.log.* settings do not effect it.
  • hhvm.log.native_stack_trace only works with hhvm.log.use_file and outputs a hexencoded trace of the HPHP internal stack.
  • If the value of hhvm.log.file starts with | then HHVM will treat the rest of the config value as a popen() command and send log messages to stdin of the spawned process.

Change 233589 merged by jenkins-bot:
Enhance stacktrace logging for fatals

https://gerrit.wikimedia.org/r/233589

Change 233590 merged by jenkins-bot:
Enhance debug log output for stacktraces

https://gerrit.wikimedia.org/r/233590

Change 236994 had a related patch set uploaded (by BryanDavis):
logging: Configure monolog to output stack traces

https://gerrit.wikimedia.org/r/236994

Change 236994 merged by jenkins-bot:
logging: Configure monolog to output stack traces

https://gerrit.wikimedia.org/r/236994

/a/mw-log/fatal.log on fluorine seems to have useful backtraces for fatal errors that we can catch in HHVM user space now.

/a/mw-log/fatal.log on fluorine seems to have useful backtraces for fatal errors that we can catch in HHVM user space now.

I saw many cases there this is not the case in the past, I don't think that bug should have been closed. fatal.log on fluorine is still lacking a lot of information compared to what we had with Zend.

Example where a backtrace is missing in both hhvm.log and logstash: T132645

bd808 removed bd808 as the assignee of this task.Apr 14 2016, 4:30 PM
In T89169#2207281, @hoo wrote:

I saw many cases there this is not the case in the past, I don't think that bug should have been closed. fatal.log on fluorine is still lacking a lot of information compared to what we had with Zend.

See analysis in T89169#1573731 for the basic outline of what will be needed to get better logging than what we have been able to achieve with the custom PHP-space error handler. The error logging that we had with Zend was due to a custom PHP extension.

One likely cause:
includes/specials/SpecialRunJobs.php: set_error_handler( function ( $errno, $errstr ) {

One likely cause:
includes/specials/SpecialRunJobs.php: set_error_handler( function ( $errno, $errstr ) {

Yeah, that will break fatal logging badly. I see in T115413: " ErrorException from line 136 of /usr/home/saper/public_html/b/includes/WebResponse.php: PHP Warning: Cannot modify header information - headers already sent" when running Special:RunJobs why @aaron added that handler, but it should delegate to MWExceptionHandler::handleError rather than returning false.

Change 283511 had a related patch set uploaded (by BryanDavis):
SpecialRunJobs: delegate error handling to MWExceptionHandler

https://gerrit.wikimedia.org/r/283511

Change 283555 had a related patch set uploaded (by JanZerebecki):
SpecialRunJobs: delegate error handling to MWExceptionHandler

https://gerrit.wikimedia.org/r/283555

Change 283511 merged by jenkins-bot:
SpecialRunJobs: delegate error handling to MWExceptionHandler

https://gerrit.wikimedia.org/r/283511

Change 283555 merged by jenkins-bot:
SpecialRunJobs: delegate error handling to MWExceptionHandler

https://gerrit.wikimedia.org/r/283555

Mentioned in SAL [2016-04-25T15:07:06Z] <jzerebecki@tin> Synchronized php-1.27.0-wmf.21/includes/specials/SpecialRunJobs.php: wmf21 rMWe411ad62516b T89169 : SpecialRunJobs: delegate error handling to MWExceptionHandler (duration: 00m 37s)

It seems there are cases where MWExceptionHandler with the hhvm we use doesn't work, see T132645#2240440 . With zend php however the same worked.

How did multiline log message handling for errors where HHVM was willing to log traces? T157396: HHVM fills logstash with junk (due to not handling multiline errors?) could use something like that.

I just wanted to note that this still is a problem and apparently many fatals are not even making it into fatal.log these days (as far as I can tell, for example T171295).

Krinkle claimed this task.

As of a few weeks ago when I checked, it seems all fatals are making their way to both fatal.log on mwlog1001 and to Logstash under both type:mediawiki channel:fatal (with stack traces) and under type:hvvm (without stacktraces).

In addition to fatal errors, we also log regular exceptions and php errors (per T45086), under type:mediawiki channel:exception and type:mediawiki channel:error (both with traces), and under type:hhvm (without traces).

See T45086 for removing the traceless duplicates from the hhvm log.

Closing this for now. File new tasks under Wikimedia-Logstash or MediaWiki-Debug-Logger.