Page MenuHomePhabricator

Capture PHP warnings with stacktraces in MediaWiki and save to logstash
Closed, ResolvedPublic

Description

This task is about PHP errors, for fatals, see T89169.


Although we get PHP warnings in the apache syslogs, we don't really get much information about it:

Dec 13 18:30:43 10.0.11.48 apache2[9987]: PHP Warning: preg_match() expects parameter 2 to be string, array given in /usr/local/apache/common-local/php-1.21wmf6/extensions/FundraiserLandingPage/FundraiserLandingPage.php on line 96
Dec 13 18:30:43 10.0.2.198 apache2[24462]: PHP Warning: preg_match() expects parameter 2 to be string, array given in /usr/local/apache/common-local/php-1.21wmf6/extensions/FundraiserLandingPage/FundraiserLandingPage.php on line 96

We could do with recording more information (to a different log file on fluorine), like we do for both the fatal log and the exception logs


Patches: https://gerrit.wikimedia.org/r/#/q/message:T45086

Checklist:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 lowered the priority of this task from High to Low.Feb 25 2016, 11:00 PM

What's the current state of this task? Is it blocked in the log volume being too high? There are a few things we could do about that:

  • only log warnings (at least initially), not notices
  • sample
  • send suppressed errors to a separate channel which can be blackholed
  • cut down the number of warnings/notices in production (a lot of that has been happening recently, so maybe it's worth doing another experiment and reassess the situation?)
In T45086#3041403, @Tgr wrote:

What's the current state of this task? Is it blocked in the log volume being too high? There are a few things we could do about that:

  • only log warnings (at least initially), not notices
  • sample
  • send suppressed errors to a separate channel which can be blackholed
  • cut down the number of warnings/notices in production (a lot of that has been happening recently, so maybe it's worth doing another experiment and reassess the situation?)

One way to provide granularity without creating N log channels would be to actually use the PSR3 log level in the generated log events:

  • suppressed -> debug
  • E_STRICT -> debug
  • E_NOTICE, E_USERNOTICE, E_DEPRECATED, E_USER_DEPRECATED -> info
  • E_WARNING, E_CORE_WARNING, E_COMPILE_WARNING, E_USER_WARNING -> warning
  • E_RECOVERABLE_ERROR -> error

It would be a pretty simple patch to MWExceptionHandler to allow the caller of MWExceptionHandler::logError() to specify a severity level and then to adjust MWExceptionHandler::handleError() to set an appropriate level. The current implementation in MWExceptionHandler::logError() emits all events as severity error.

We [return false from the error handler] that because it's the correct way for a default set_error_handler() to function. If we swallowed the errors there without returning false then $php_errormsg would not be populated and all errors would be handled as though the @ suppression operator was applied to every line in the codebase.

We don't seem to use $php_errormsg anymore, so as far as I can see all that would do is to switch error logginrg to Monolog instead of the PHP/HHVM log files. (Assuming we die / throw an exception for errors which would interrupt program execution when unhandled.) Isn't that a good thing? It seems counterproductive to have the same error in logstash both as a Monolog event with traces etc. and a mostly useless HHVM record. Or are we worried about Monolog being more fragile / less reliable than logfiles/syslog?

Change 338820 had a related patch set uploaded (by Tim Starling):
Route PHP warnings from the handler into logstash

https://gerrit.wikimedia.org/r/338820

In T45086#3041502, @Tgr wrote:

We [return false from the error handler] that because it's the correct way for a default set_error_handler() to function. If we swallowed the errors there without returning false then $php_errormsg would not be populated and all errors would be handled as though the @ suppression operator was applied to every line in the codebase.

We don't seem to use $php_errormsg anymore, so as far as I can see all that would do is to switch error logginrg to Monolog instead of the PHP/HHVM log files. (Assuming we die / throw an exception for errors which would interrupt program execution when unhandled.) Isn't that a good thing? It seems counterproductive to have the same error in logstash both as a Monolog event with traces etc. and a mostly useless HHVM record. Or are we worried about Monolog being more fragile / less reliable than logfiles/syslog?

I think you are confusing the log data that we collect in WMF production and how PHP/HHVM work. The error handler must either abort the execution of the request or not. By not returning false and instead handling everything internally in the custom handler, the handler itself becomes responsible for deciding when to abort the request. If the handler returns (i.e. doesn't abort the request somehow) then it is functionally the same as applying @ suppression to all lines in all scripts.

Is someone actually saying the log volume is too high? It looks to me like it's blocked on nobody giving a crap since 2015. Suppressed errors were removed from the stream in https://gerrit.wikimedia.org/r/#/c/177096/

One way to provide granularity without creating N log channels would be to actually use the PSR3 log level in the generated log events:

I would still put suppressed logs into their own channel because the way we seem to be using suppression is usually to prevent the program from being stopped and not so much to suppress logging, so having severity information for suppressed events seems useful. I agree about the rest (I actually wrote the patch yesterday but apparently forgot to submit it).

Change 338821 had a related patch set uploaded (by BryanDavis):
MWExceptionHandler::handleError: Set log severity based on error level

https://gerrit.wikimedia.org/r/338821

By not returning false and instead handling everything internally in the custom handler, the handler itself becomes responsible for deciding when to abort the request.

Yes, and it's a trivial decision (assuming we want to replicate the existing behavior): just abort on *_ERROR, unless PHP error reporting is turned off. So would there be any disadvantage to doing that? We could abort via exceptions which is more useful (HHVM can log the trace when the script aborts with an error but normal PHP can't), plus generally have better logging, at the cost of higher log volume / manual setup needed. It should probably be configurable and off by default but for WMF servers (and anyone else who cares enough about error logging to fine-tune it) it seems like a win.

In T45086#3041550, @Tgr wrote:

By not returning false and instead handling everything internally in the custom handler, the handler itself becomes responsible for deciding when to abort the request.

Yes, and it's a trivial decision (assuming we want to replicate the existing behavior): just abort on *_ERROR, unless PHP error reporting is turned off. So would there be any disadvantage to doing that? We could abort via exceptions which is more useful (HHVM can log the trace when the script aborts with an error but normal PHP can't), plus generally have better logging, at the cost of higher log volume / manual setup needed. It should probably be configurable and off by default but for WMF servers (and anyone else who cares enough about error logging to fine-tune it) it seems like a win.

How is raising an unhandled exception functionally different than passing control back to the default error handler? Is this an attempt to get more data into the PHP runtime's error log or less? I'm confused about the goal here.

How is raising an unhandled exception functionally different than passing control back to the default error handler? Is this an attempt to get more data into the PHP runtime's error log or less? I'm confused about the goal here.

The point is that the handler should not return false on errors which PHP would only log but not abort on, so that warnings don't get duplicated. For errors which need the script to abort, IMO just push them to the exception channel (ie. don't log in the error handler but throw an exception) because that behaves nicer with non-log-based debugging (e.g. the developer would see a stack trace instead of just an error message, the user would see an exception hash that they can report). But logging and returning false, or just dieing, are also viable alternatives, the primary issue is log duplication for non-fatals.

In T45086#3041550, @Tgr wrote:

By not returning false and instead handling everything internally in the custom handler, the handler itself becomes responsible for deciding when to abort the request.

Yes, and it's a trivial decision (assuming we want to replicate the existing behavior): just abort on *_ERROR, unless PHP error reporting is turned off. So would there be any disadvantage to doing that? We could abort via exceptions which is more useful (HHVM can log the trace when the script aborts with an error but normal PHP can't), plus generally have better logging, at the cost of higher log volume / manual setup needed. It should probably be configurable and off by default but for WMF servers (and anyone else who cares enough about error logging to fine-tune it) it seems like a win.

If the handler returns false then $php_errormsg is not set (based on code inspection). Grep shows it being used by PEAR Net/Socket, it calls @ini_set('track_errors', 1).

In T45086#3041561, @Tgr wrote:

The point is that the handler should not return false on errors which PHP would only log but not abort on, so that warnings don't get duplicated.

Ok, I'm finally getting what you are working towards. Thanks for sticking with me. :) It would need some config flag as you mentioned in T45086#3041550 since the default behavior of the logging system is discard everything.

Change 338821 merged by jenkins-bot:
MWExceptionHandler::handleError: Set log severity based on error level

https://gerrit.wikimedia.org/r/338821

Change 338901 had a related patch set uploaded (by Gergő Tisza):
Do not mix suppressed errors with non-suppressed ones

https://gerrit.wikimedia.org/r/338901

Change 338911 had a related patch set uploaded (by Gergő Tisza):
Make it possible to not propagate errors to PHP

https://gerrit.wikimedia.org/r/338911

Change 338820 merged by jenkins-bot:
Route PHP warnings from the handler into udp2log

https://gerrit.wikimedia.org/r/338820

Example error. See T136849#3045285 for comments on the format. (Except exception.trace seems different here. Or just has random errors in different ways than for other exceptions.)

The error channel (with severity=ERROR or maybe WARNING) should be added added to fatalmonitor on logstash, I suppose?

It's been several months, let's look at the current state of affairs:

Known issues or remaining tasks:

  • Nothing? Resolved?

Checklist (EDIT: Moved to task description.)

Change 374895 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception: Support message normalisation for structured logging

https://gerrit.wikimedia.org/r/374895

Change 374895 merged by jenkins-bot:
[mediawiki/core@master] exception: Support message normalisation for structured logging

https://gerrit.wikimedia.org/r/374895

Change 409171 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Stop PHP errors from going to the hhvm channel

https://gerrit.wikimedia.org/r/409171

Change 338911 merged by jenkins-bot:
[mediawiki/core@master] Make it possible to not propagate errors to PHP

https://gerrit.wikimedia.org/r/338911

Change 409171 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop PHP errors from going to the hhvm channel

https://gerrit.wikimedia.org/r/409171

Mentioned in SAL (#wikimedia-operations) [2018-02-12T19:07:47Z] <niharika29@tin> Synchronized wmf-config/InitialiseSettings-labs.php: Stop PHP errors from going to the hhvm channel T45086 (duration: 00m 56s)

Change 423338 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Set $wgPropagateErrors to false in production

https://gerrit.wikimedia.org/r/423338

Seems like some PHP errors are only logged in the HHVM channel (T193339). Not sure whether they end up there with a mechanism that $wgPropagateError would affect.

In T45086#4166452, @Tgr wrote:

Seems like some PHP errors are only logged in the HHVM channel (T193339). Not sure whether they end up there with a mechanism that $wgPropagateError would affect.

At T193339, the following sample is referenced from type:hhvm:

2018-04-29T09:25:59
t  _type	hhvm
t  host	mw1315
t  level	NOTICE
t  message	Notice: Undefined index: 0 in /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php on line 1498

Looking in type:mediawiki (dashboard "mediawiki-errors" with INFO and DEBUG un-disabled) I do find the same error trending

Trending
Count: 16,672
Message:
[{exception_id}] {exception_url} ErrorException from line 1498 of /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php: PHP Notice: Undefined index: 0

And the exact duplicate (based on timestamp, host and message):
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.04.29/mediawiki?id=AWMQuSfhpesmgM3l0MzT&_g=h@b6494f7

Sample
@timestamp	2018-04-29T09:25:59
t  _type	mediawiki
t  host		mw1315
t  level	INFO
t  server	ar.wikipedia.org
t  url		/w/api.php
[..]
t  exception.message	PHP Notice: Undefined index: 0
t  exception.trace	#0 /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php(1498): MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php(955): AbuseFilter::takeConsequenceAction(string, array, Title, AbuseFilterVariableHolder, string, integer)
#2 /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilter.php(1105): AbuseFilter::executeFilterActions(array, Title, AbuseFilterVariableHolder)
#3 /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilterHooks.php(119): AbuseFilter::filterAction(AbuseFilterVariableHolder, Title)
#4 /srv/mediawiki/php-1.32.0-wmf.1/extensions/AbuseFilter/includes/AbuseFilterHooks.php(52): AbuseFilterHooks::filterEdit(RequestContext, WikitextContent, string, Status, string, boolean)
#5 /srv/mediawiki/php-1.32.0-wmf.1/includes/Hooks.php(177): AbuseFilterHooks::onEditFilterMergedContent(RequestContext, WikitextContent, Status, string, User, boolean)
#6 /srv/mediawiki/php-1.32.0-wmf.1/includes/Hooks.php(205): Hooks::callHook(string, array, array, NULL)
#7 /srv/mediawiki/php-1.32.0-wmf.1/includes/EditPage.php(1679): Hooks::run(string, array)
#8 /srv/mediawiki/php-1.32.0-wmf.1/includes/EditPage.php(1998): EditPage->runPostMergeFilters(WikitextContent, Status, User)
#9 /srv/mediawiki/php-1.32.0-wmf.1/includes/EditPage.php(1506): EditPage->internalAttemptSave(NULL, boolean)
#10 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiEditPage.php(392): EditPage->attemptSave(NULL)
#11 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiMain.php(1579): ApiEditPage->execute()
#12 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiMain.php(504): ApiMain->executeAction()
#13 /srv/mediawiki/php-1.32.0-wmf.1/extensions/VisualEditor/includes/ApiVisualEditorEdit.php(69): ApiMain->execute()
#14 /srv/mediawiki/php-1.32.0-wmf.1/extensions/VisualEditor/includes/ApiVisualEditorEdit.php(382): ApiVisualEditorEdit->saveWikitext(Title, string, array)
#15 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiMain.php(1579): ApiVisualEditorEdit->execute()
#16 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiMain.php(535): ApiMain->executeAction()
#17 /srv/mediawiki/php-1.32.0-wmf.1/includes/api/ApiMain.php(506): ApiMain->executeActionWithErrorHandling()
#18 /srv/mediawiki/php-1.32.0-wmf.1/api.php(83): ApiMain->execute()
#19 /srv/mediawiki/w/api.php(3): include(string)
#20 {main}

Hm, not sure why I didn't find it. Thanks Krinkle.

What would be the next step here? Get a full list of all type:hhvm errors, filter out the ones which are also found in type:mediawiki channel:error, make sure that the rest are not propagated PHP errors, then enable $wgPropagateError in production?

@Tgr Yeah, take of small chunk of say 2 hours and then correlate as you describe. Keep in mind that T200960 was/is causing serious loss of data all over the place, including between these two channels. It's close to being fixed, so best to look at Grafana and find a comfortable period of time range where there was 0 packet loss (which until last week, was never).

Change 423338 merged by jenkins-bot:
[operations/mediawiki-config@master] Set $wgPropagateErrors to false in production

https://gerrit.wikimedia.org/r/423338

Mentioned in SAL (#wikimedia-operations) [2018-10-13T23:37:21Z] <krinkle@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T45086 - I4857e8ac (duration: 00m 51s)

I wasn't aware of $wgPropagateErrors, indeed we no more need errors to be spurts to HHVM stdout (which is type:hhvm). That is a nice improvement to the logging and tracking errors :]

Krinkle added a project: Performance-Team.

Last remaining check box:

  • Remove unused exception-json conversion filter from filter-mediawiki.conf in operations/puppet.git.

Pushing back. But would be nice to finish up at some point, but not now. To anyone else that happens to run into it: feel free to take it.

Change 494042 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] logstash: Remove filter for unused 'exception-json' channel

https://gerrit.wikimedia.org/r/494042

Change 494042 merged by Herron:
[operations/puppet@production] logstash: Remove filter for unused 'exception-json' channel

https://gerrit.wikimedia.org/r/494042

Krinkle removed a project: Patch-For-Review.
Krinkle moved this task from Blocked (old) to Doing (old) on the Performance-Team board.
Krinkle updated the task description. (Show Details)

Thanks for pushing this through the finish line @Krinkle!

Nice! That concludes a multiple years project :-]