Page MenuHomePhabricator

Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash
Closed, ResolvedPublic

Description

Looking at https://logstash.wikimedia.org/goto/df21b39c8ace0202a1893be36d885642 it appears that very few log messages (~30/hr) end up with an empty normalized_message, and look like this:

These errors are all produced when php-fpm bails out of execution and log handling is deferred to the wmerrors extension, that executes a php script contained in mediawiki-config. Inspecting the messages, it seems they have correctly the exception.message populated, but have no normalized message to show.

I think we can improve the situation for at least the most common cases.

Event Timeline

jijiki created this task.Sep 25 2019, 5:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2019, 5:11 PM
Joe renamed this task from Some syslog messages to Errors managed by wmf-errors (like OOMs) lack normalized_message on logstash.Sep 26 2019, 9:54 AM
Joe updated the task description. (Show Details)
herron added a subscriber: herron.

These logs appear to be nesting the message field inside the exception field, and the message field at the root is not present. Which should explain why normalized_message is containing a literal %{message}.

Here's the JSON from an example log pulled from Kafka.

{
  "timestamp": "2019-09-27T19:57:30.276417+00:00",
  "logsource": "mw1348",
  "host": "mw1348",
  "program": "php7.2-fpm",
  "severity": "err",
  "facility": "daemon",
  "exception": {
    "message": "Allowed memory size of 692060160 bytes exhausted (tried to allocate 20480 bytes)",
    "file": "/srv/mediawiki/php-1.34.0-wmf.24/vendor/wikibase/data-model/src/Snak/PropertyValueSnak.php:67",
    "trace": "#1  unserialize() called at [/srv/mediawiki/php-1.34.0-wmf.24/vendor/wikibase/data-model/src/Snak/PropertyValueSnak.php:67]\n#2  Wikibase\\DataModel\\Snak\\PropertyValueSnak->unserialize()\n#3  unserialize() called at [/srv/mediawiki/php-1.34.0-wmf.24/vendor/wikibase/data-model/src/Snak/SnakList.php:290]\n#4  Wikibase\\DataModel\\Snak\\SnakList->unserialize()\n#5  Memcached->get() called at [/srv/mediawiki/php-1.34.0-wmf.24/includes/libs/objectcache/MemcachedPeclBagOStuff.php:186]\n#6  MemcachedPeclBagOStuff->doGet() called at [/srv/mediawiki/php-1.34.0-wmf.24/includes/libs/objectcache/MediumSpecificBagOStuff.php:112]\n#7  MediumSpecificBagOStuff->get() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/Wikibase/lib/includes/Store/EntityRevisionCache.php:74]\n#8  Wikibase\\Lib\\Store\\EntityRevisionCache->get() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/Wikibase/lib/includes/Store/CacheRetrievingEntityRevisionLookup.php:104]\n#9  Wikibase\\Lib\\Store\\CacheRetrievingEntityRevisionLookup->getEntityRevisionFromCache() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/Wikibase/lib/includes/Store/CacheRetrievingEntityRevisionLookup.php:72]\n#10 Wikibase\\Lib\\Store\\CacheRetrievingEntityRevisionLookup->getEntityRevision() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/Wikibase/lib/includes/Store/RevisionBasedEntityLookup.php:40]\n#11 Wikibase\\Lib\\Store\\RevisionBasedEntityLookup->getEntity() called at [/srv/mediawiki/php-1.34.0-wmf.24/vendor/wikibase/data-model-services/src/Lookup/RedirectResolvingEntityLookup.php:51]\n#12 Wikibase\\DataModel\\Services\\Lookup\\RedirectResolvingEntityLookup->getEntity() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Helper/ExceptionIgnoringEntityLookup.php:37]\n#13 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\Helper\\ExceptionIgnoringEntityLookup->getEntity() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/SymmetricChecker.php:106]\n#14 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\Checker\\SymmetricChecker->checkConstraint() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:566]\n#15 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\DelegatingConstraintChecker->getCheckResultFor() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:460]\n#16 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\DelegatingConstraintChecker->checkConstraintsForMainSnak() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:374]\n#17 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\DelegatingConstraintChecker->checkStatement() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:346]\n#18 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\DelegatingConstraintChecker->checkEveryStatement() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:155]\n#19 WikibaseQuality\\ConstraintReport\\ConstraintCheck\\DelegatingConstraintChecker->checkAgainstConstraintsOnEntityId() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/Api/CheckingResultsSource.php:53]\n#20 WikibaseQuality\\ConstraintReport\\Api\\CheckingResultsSource->getResults() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/Api/CachingResultsSource.php:247]\n#21 WikibaseQuality\\ConstraintReport\\Api\\CachingResultsSource->getAndStoreResults() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/Api/CachingResultsSource.php:179]\n#22 WikibaseQuality\\ConstraintReport\\Api\\CachingResultsSource->getResults() called at [/srv/mediawiki/php-1.34.0-wmf.24/extensions/WikibaseQualityConstraints/src/Api/CheckConstraints.php:179]\n#23 WikibaseQuality\\ConstraintReport\\Api\\CheckConstraints->execute() called at [/srv/mediawiki/php-1.34.0-wmf.24/includes/api/ApiMain.php:1599]\n#24 ApiMain->executeAction() called at [/srv/mediawiki/php-1.34.0-wmf.24/includes/api/ApiMain.php:537]\n#25 ApiMain->executeActionWithErrorHandling() called at [/srv/mediawiki/php-1.34.0-wmf.24/includes/api/ApiMain.php:508]\n#26 ApiMain->execute() called at [/srv/mediawiki/php-1.34.0-wmf.24/api.php:87]\n#27 require(/srv/mediawiki/php-1.34.0-wmf.24/api.php) called at [/srv/mediawiki/w/api.php:3]\n"
  },
  "phpversion": "7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1",
  "reqId": "XY5pEwpAIDwAAInohCsAAABC",
  "http_method": "GET",
  "server": "www.wikidata.org",
  "url": "/w/api.php?action=wbcheckconstraints&format=json&formatversion=2&uselang=en&id=Q865&status=violation%7Cwarning%7Csuggestion%7Cbad-parameters",
  "ip": "redacted"
}

Change 539621 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: parse nested JSON in php7.2-fpm exception field

https://gerrit.wikimedia.org/r/539621

Change 539623 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: if php7.2-fpm message field is empty, use exception.message

https://gerrit.wikimedia.org/r/539623

A few ideas to address this:

Parse the nested exception field into the root of the logstash event: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539621/

Copy exception.message to message: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539623/

Or of course could address this at the source, so that a message field is included within the @cee formatted message.

Krinkle renamed this task from Errors managed by wmf-errors (like OOMs) lack normalized_message on logstash to Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash.Sep 27 2019, 9:21 PM
Krinkle updated the task description. (Show Details)

I think this should be fixed at the source in puppet: php7-fatal-error.php.

The json fields here were modelled after MediaWiki's MWExceptionHandler format, and the plugins we have to it in production (wmf-config/logging.php).

But, the omission of the message field was unintentional.

Joe added a subscriber: Joe.Sep 30 2019, 8:08 AM

I think this should be fixed at the source in puppet: php7-fatal-error.php.
The json fields here were modelled after MediaWiki's MWExceptionHandler format, and the plugins we have to it in production (wmf-config/logging.php).
But, the omission of the message field was unintentional.

I agree, and I intended to do that today.

Joe added a comment.Sep 30 2019, 8:32 AM

The json fields here were modelled after MediaWiki's MWExceptionHandler format, and the plugins we have to it in production (wmf-config/logging.php).
But, the omission of the message field was unintentional.

I agree, and I intended to do that today.

On second thoughts, we already have a problem with traces that are too long for the normal 32k of the syslog message. Duplicating "message" at emission time would only exhacerbate the problem.

I prefer @herron's approach for this reason.

OTOH, After looking further into this, I think we're actually counting these errors multiple times - see for instance https://logstash.wikimedia.org/goto/c92404ce11520d59feb9211471d5bf09

Krinkle triaged this task as High priority.Sep 30 2019, 1:27 PM

[…] After looking further into this, I think we're actually counting these errors multiple times - see for instance https://logstash.wikimedia.org/goto/c92404ce11520d59feb9211471d5bf09

One is fpm stderr, which is not useful (no stack traces, no rich MW context fields). This is similar to the hhvm syslog we've had in the past, and our custom syslog rewriter for that.

The one from fatal-error.php should appear under type:mediawiki channel:fatal instead of type:syslog. Right now these are not being found by developers and most monitoring queries. Classifying as high priority because this means every Scap deploy and train promotion may be acting on false confidence – given 100% PHP 7 traffic.

@Joe wrote:

I prefer @herron's approach for this reason.

Yeah, that seems neat indeed. Perhaps we can duplicate the exception.message field into message. The creation of normalized_message I assume will be done by another Logstash filter (we don't have any added abstraction information to create a useful custom one).

Change 539623 merged by Herron:
[operations/puppet@production] logstash: if php7.2-fpm message field is empty, use exception.message

https://gerrit.wikimedia.org/r/539623

Change 539875 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: if php7.2-fpm message field is empty, use exception.message

https://gerrit.wikimedia.org/r/539875

Change 539875 merged by Herron:
[operations/puppet@production] logstash: if php7.2-fpm message field is empty, use exception.message

https://gerrit.wikimedia.org/r/539875

Change 539881 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: mediawiki logstash_checker

https://gerrit.wikimedia.org/r/539881

Copying excepetion.message to message looks to have made an improvement here. I see no results in the past 15 minutes when querying for normalized_message:%{messsage}

for example https://logstash.wikimedia.org/goto/1f5fb74aaad53e1fe98c50aa5a245881

fgiunchedi moved this task from Inbox to In progress on the observability board.Sep 30 2019, 3:19 PM

Change 539621 abandoned by Herron:
logstash: parse nested JSON in php7.2-fpm exception field

Reason:
in favor of I6793e9b8bf6d5332fb966c64fb2c066a6419bc2a

https://gerrit.wikimedia.org/r/539621

[…] After looking further into this, I think we're actually counting these errors multiple times - see for instance https://logstash.wikimedia.org/goto/c92404ce11520d59feb9211471d5bf09

One is fpm stderr, which is not useful (no stack traces, no rich MW context fields). This is similar to the hhvm syslog we've had in the past, and our custom syslog rewriter for that.
The one from fatal-error.php should appear under type:mediawiki channel:fatal instead of type:syslog. Right now these are not being found by developers and most monitoring queries. Classifying as high priority because this means every Scap deploy and train promotion may be acting on false confidence – given 100% PHP 7 traffic.

Splitt off into T234283.