Example event:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.12.18/mediawiki/?id=AW8bbGJAarkxubcmBO13
type:mediawiki channel:fatal caught_by: /etc/php/php7-fatal-error.php (via wmerrors) exception.file: /srv/mediawiki/wmf-config/set-time-limit.php:39 exception.message: Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39 Stack trace: #0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1) #1 [internal function]: Parser->__destruct() #2 {main} thrown exception.trace: message: Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39 Stack trace: #0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1) #1 [internal function]: Parser->__destruct() #2 {main} thrown
Notice how exception.trace is empty. And, instead it is included (but cut off) from the exception.message.
Looking at the php-fpm stderr I find for the same host/timestamp this event:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.12.18/syslog/?id=AW8bbGJAarkxubcmBO15
program :php-fpm message: PHP Fatal error: Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39 Stack trace: #0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1) #1 [internal function]: Parser->__destruct() #2 {main} thrown in /srv/mediawiki/wmf-config/set-time-limit.php on line 39
Looking at Logstash more generally it seems that for shutdown/destruct stacks, this is always the case. And for others, it is correctly reported through exception.trace. This is an issue for two reasons:
- As part of the message it's repeated three times (approximating overall limit, risking getting trimmed). And even within its own field, risks getting trimmed, as already seen above even for the relatively small stack. This might be caused by other reasons as well, not sure yet.
- Due to not being in exception.trace it means product and/or team specific Logstash dashboards never see these failures (given we filter by trace).