Page MenuHomePhabricator

Standardise on Logstash field for exceptions with back traces
Open, MediumPublic

Description

We currently have at least four different ways of logging the exception trace when logging messages to Logstash.

  1. For fatal errors (channel:fatal), we use fatal_exception => Exception $e which results in fields like fatal_exception.message, fatal_exception.trace, fatal_exception.class, etc.
  1. For uncaught exceptions (channel:exception), we use exception => Exception $e which results in fields like exception.message, exception.trace, etc.
  1. For failed DeferredUpdates (channel:DeferredUpdates), we use [ 'message' => …, 'trace' => ], which results in fields like c_message, trace, etc. (c_message because "message" is a reserved key).
  1. For various other ad-hoc logging we seem to use [ 'trace' => … ] mostly.

Problem: This currently means that Logstash dashboards for individual teams either have to repeat their searches several times (which is quite messy to do in Kibana last time I tried), or end up missing some variants.

Keeping ad-hoc logging separate might be okay as they're not usually errors but rather traces we tacked onto debug messages that otherwise wouldn't have a trace. We're already consistent with those so they would only require 1 duplicate search to include.

Desired outcome: Decide which way we want to log exceptions, for the first three cases above.

Proposal 1: Use exception, exception.trace etc.

Sub tasks:

Event Timeline

Krinkle created this task.Sep 19 2019, 6:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2019, 6:08 PM
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: aaron, tstarling.
Anomie added a subscriber: Anomie.
  1. For various other ad-hoc logging we seem to use [ 'trace' => … ] mostly.

For various ad-hoc logging I typically use [ 'exception' => Exception $e ]. Overall that seems to be the more common, too.

  1. For various other ad-hoc logging we seem to use [ 'trace' => … ] mostly.

For various ad-hoc logging I typically use [ 'exception' => Exception $e ]. Overall that seems to be the more common, too.

I didn't think to look for that, thanks. Hadn't seen that one before. Core has 31 uses of trace, and 40 uses of exception (39 if not including the one for the exception channel).

Change 549645 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception: Log fatal exception data as 'exception' instead of 'fatal_exception'

https://gerrit.wikimedia.org/r/549645

mmodell added a subscriber: mmodell.Nov 8 2019, 2:36 AM

Change 549645 merged by jenkins-bot:
[mediawiki/core@master] exception: Log fatal exception data as 'exception' instead of 'fatal_exception'

https://gerrit.wikimedia.org/r/549645

Change 553182 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] deferred: Log trace as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/553182

Change 553372 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.35.0-wmf.5] exception: Log fatal exception data as 'exception' instead of 'fatal_exception'

https://gerrit.wikimedia.org/r/553372

Change 553372 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.5] exception: Log fatal exception data as 'exception' instead of 'fatal_exception'

https://gerrit.wikimedia.org/r/553372

Confirmed via mwdebug1001. Messages in channel:fatal having caught_by: mwe_handler (not php-wmerrors), now use exception.* instead of fatal_exception. Thus matching what we do for other uncaught exceptions and what php-wmerrors uses for simpler dashboarding.

Krinkle updated the task description. (Show Details)Dec 2 2019, 10:21 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

Change 554617 had a related patch set uploaded (by SBassett; owner: SBassett):
[mediawiki/core@master] Log fatal exception data as 'exception' instead of 'trace'

https://gerrit.wikimedia.org/r/554617

Change 554637 had a related patch set uploaded (by Krinkle; owner: SBassett):
[mediawiki/core@wmf/1.35.0-wmf.8] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/554637

Change 554617 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/554617

Krinkle triaged this task as Medium priority.Dec 5 2019, 12:00 AM

Change 554637 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.8] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/554637

Mentioned in SAL (#wikimedia-operations) [2019-12-05T22:00:47Z] <krinkle@deploy1001> Synchronized php-1.35.0-wmf.8/includes/libs/rdbms/database/: T233342 (duration: 01m 02s)

Change 556289 had a related patch set uploaded (by Reedy; owner: SBassett):
[mediawiki/core@REL1_34] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556289

Change 556291 had a related patch set uploaded (by Reedy; owner: SBassett):
[mediawiki/core@REL1_33] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556291

Change 556292 had a related patch set uploaded (by Reedy; owner: SBassett):
[mediawiki/core@REL1_32] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556292

Change 556293 had a related patch set uploaded (by Reedy; owner: SBassett):
[mediawiki/core@REL1_31] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556293

Change 556289 merged by jenkins-bot:
[mediawiki/core@REL1_34] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556289

Change 556293 merged by jenkins-bot:
[mediawiki/core@REL1_31] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556293

Change 556291 merged by jenkins-bot:
[mediawiki/core@REL1_33] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556291

Krinkle reassigned this task from Krinkle to Reedy.Dec 11 2019, 6:31 PM

Change 556509 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception: Remove ErrorException workaround in handleFatalError()

https://gerrit.wikimedia.org/r/556509

Change 556292 merged by jenkins-bot:
[mediawiki/core@REL1_32] rdbms: Log debug message traces as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/556292

Change 556509 merged by jenkins-bot:
[mediawiki/core@master] exception: Remove ErrorException workaround in handleFatalError()

https://gerrit.wikimedia.org/r/556509

Change 553182 merged by jenkins-bot:
[mediawiki/core@master] deferred: Log trace as 'exception.trace' instead of 'trace'

https://gerrit.wikimedia.org/r/553182

daniel added a subscriber: daniel.Apr 16 2020, 11:51 AM

I'm not sure I understand the proposal for the "ad-hoc" case, as in

For various other ad-hoc logging we seem to use [ 'trace' => … ] mostly.

What should replace this with? This is not reporting an exception. The idea is to log a warning and include a stack trace as context. Reporting this as exception.trace would be misleading. I think the 'trace' key is fine for this use case.

Krinkle added a comment.EditedApr 16 2020, 9:36 PM

I'm not sure I understand the proposal for the "ad-hoc" case, as in

For various other ad-hoc logging we seem to use [ 'trace' => … ] mostly.

What should replace this with?

With exception. And various ad-hoc code paths have already been updated this way from the above patches (e.g. rdbms).

Task description

Proposal: Use exception, exception.trace etc.

This is not reporting an exception. The idea is to log a warning and include a stack trace as context. Reporting this as exception.trace would be misleading.

It won't make it reported as an uncaught exception. It's just the name of UI field inside Kibana. Creating new RuntimeException to call getTraceAsString on directly is no more real an exception than creating it and passing it to Logger to handle deal format accordingly.

I think the 'trace' key is fine for this use case.

The problem is that stack traces need to be sanitized as they tend to be copy-pasted to public tasks (T234014). Doing that ad-hoc is a waste of our time, and also will inevitably lead to incidents. The other thing is that by having a consistent key, Kibana almost becomes usable and might make it so that teams can actually be effective in maintaining a dashboard to look our for events relevant to code they maintain. Which channels they monitor is their own choice, but the filter for matching stack traces will be universal (in this case, exception.trace).