Page MenuHomePhabricator

Stack for shutdown/destruct fatals missing from php7-fatal-error.php logs
Closed, ResolvedPublic

Description

Example event:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.12.18/mediawiki/?id=AW8bbGJAarkxubcmBO13

type:mediawiki channel:fatal
caught_by: /etc/php/php7-fatal-error.php (via wmerrors)
exception.file: /srv/mediawiki/wmf-config/set-time-limit.php:39

exception.message:
Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39
Stack trace:
#0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1)
#1 [internal function]: Parser->__destruct()
#2 {main}
  thrown

exception.trace:

message:
Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39
Stack trace:
#0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1)
#1 [internal function]: Parser->__destruct()
#2 {main}
  thrown

Notice how exception.trace is empty. And, instead it is included (but cut off) from the exception.message.

Looking at the php-fpm stderr I find for the same host/timestamp this event:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.12.18/syslog/?id=AW8bbGJAarkxubcmBO15

program :php-fpm
message:
PHP Fatal error:  Uncaught WMFTimeoutException: the execution time limit of 60 seconds was exceeded in /srv/mediawiki/wmf-config/set-time-limit.php:39
Stack trace:
#0 /srv/mediawiki/php-1.35.0-wmf.10/includes/parser/Parser.php(414): {closure}(1)
#1 [internal function]: Parser->__destruct()
#2 {main}
  thrown in /srv/mediawiki/wmf-config/set-time-limit.php on line 39

Looking at Logstash more generally it seems that for shutdown/destruct stacks, this is always the case. And for others, it is correctly reported through exception.trace. This is an issue for two reasons:

  1. As part of the message it's repeated three times (approximating overall limit, risking getting trimmed). And even within its own field, risks getting trimmed, as already seen above even for the relatively small stack. This might be caused by other reasons as well, not sure yet.
  2. Due to not being in exception.trace it means product and/or team specific Logstash dashboards never see these failures (given we filter by trace).

Event Timeline

Change 559262 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php

https://gerrit.wikimedia.org/r/559262

@Joe @jijiki Would like to deploy this soon. Ping me any time this week if you can squeeze it in so I can stand by for confirmation.

EDIT: Wanted to test in Beta first but can't due to T242658. Once that resolves, I'll test there and then ping back here.

@Krinkle, I am fine to merge any time, as soon as you have tested it:)

Krinkle changed the task status from Open to Stalled.Jan 14 2020, 5:25 PM
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Doing (old) to Blocked (old) on the Performance-Team board.

Change 559262 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php

https://gerrit.wikimedia.org/r/559262