Page MenuHomePhabricator

Standardise on Logstash channel for fatal exceptions (fatal, exception, DeferredUpdates)
Closed, ResolvedPublic

Description

We have:

  • fatal,
  • exception,
  • DeferredUpdates

It's become quite difficult to explain the difference between these, and the difference is blurring more and more over time as details change over time within PHP, MediaWiki, php-fpm, and our way of configuring these.

For example, one could say that fatal is native PHP fatals (OOM, timeout, segfault). And that exception is those that are explicitly thrown and (eventually) caught by MediaWiki.

… except timeouts from the native php-excimer extension are thrown as normal exceptions and so end up in exception. This is a detail developers shouldn't need to worry about.

… and even "regular" exceptions can end up in fatal because when they are (also) reported by php-wmerrors via /etc/php/php7-fatal-error.php, then we can't distinguish between the two and use the fatal channel always.

… and then there are fatal errors which happen during DeferredUpdates, which in addition to having a custom log message in their own channel, are prevented from going to fatal or exception, (contrary to JobRunner and ResourceLoader, which send them also as regular exception for logging convenience).

Due to the last item, I've been trying to figure out all the different places where we monitor exception and fatal channels to include DeferredUpdates, but even a year in I still find new places where we've not yet done so. For example, the MediaWiki Alerts dashboard doesn't include it yet, the Incinga alerts for MediaWiki exceptions doesn't either, and I think Scap canary's query also doesn't yet look for it.

Proposal

  • Retire the fatal channel within MediaWiki and move its remainig messages to exception instead.
  • Change DeferredUpdates.php to call MWExceptionHandler::logException instead of doing its own thing. This means it will be part of exception like other uncaught exceptions are.
    • Its custom message can remain in its own channel, but we'd no longer need to include this channel in our monitoring stuff etc.
  • Update /etc/php/php7-fatal-error.php to use exception instead.

See also:

Event Timeline

Krinkle created this task.Mar 6 2020, 6:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2020, 6:23 PM

Change 577645 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mediawiki: Change php-wmerrors channel from "fatal" to as "exception"

https://gerrit.wikimedia.org/r/577645

Krinkle updated the task description. (Show Details)Mar 6 2020, 6:49 PM

Change 577665 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] exception,deferred: Standardise on 'exception' for uncaughts and fatals

https://gerrit.wikimedia.org/r/577665

Change 577665 merged by jenkins-bot:
[mediawiki/core@master] exception,deferred: Standardise on 'exception' for uncaughts and fatals

https://gerrit.wikimedia.org/r/577665

Krinkle claimed this task.Mar 10 2020, 5:47 PM
Krinkle triaged this task as Medium priority.

Change 577645 merged by Alexandros Kosiaris:
[operations/puppet@production] mediawiki: Change php-wmerrors channel from "fatal" to as "exception"

https://gerrit.wikimedia.org/r/577645

Krinkle closed this task as Resolved.Mar 16 2020, 8:45 PM

Change 582153 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] scap: Sync logstash_checker.py canary query with current dashboard

https://gerrit.wikimedia.org/r/582153

Change 582153 merged by Giuseppe Lavagetto:
[operations/puppet@production] scap: Sync logstash_checker.py canary query with current dashboard

https://gerrit.wikimedia.org/r/582153