Page MenuHomePhabricator

wfShellExec errors end up in HHVM log
Closed, ResolvedPublicPRODUCTION ERROR

Description

It seems some (all?) wfShellExec errors end up in the HHVM fatal log, which is quite useless as it contains neither a stack trace nor wiki/URL information. These errors should be reported by wfShellExec via Monolog.

One example is T157644: Shell error from "pdftotext" and "pdfinfo" commands (e.g. "Syntax Error: Invalid XRef entry"); IIRC this was discussed in the past in the context of T111441: SiteConfiguration::getConfig() does not work in Wikimedia production but I can't remember the details.

Event Timeline

thcipriani triaged this task as Medium priority.Feb 15 2017, 4:30 PM
thcipriani added subscribers: hashar, thcipriani.

@hashar believes (and I have no reason to doubt) that this is the ultimate cause of all of the Syntax Error/Warning type messages in the logs, e.g.:

Syntax Warning: Could not parse ligature component \"127\" of \"_127\" in parseCharName
Syntax Error: Invalid XRef entry
Syntax Error (3714139): Bad 'Length' attribute in stream

There are considerable spikes of these errors frequently: https://logstash.wikimedia.org/goto/77b024392153d591645d6e0d48a7da61

I tend to look at all hhvm logstash messages while deploying so the spikes caused by this family of messages make it very difficult to see whether a deployment was successful or not. We actually hide a good deal of these errors in fatalmonitor (see https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor).

Individually blocking each message on the logstash dashboard is clearly a sisyphean task. If anyone can help to fix/mitigate the root cause it would be very much appreciated.

EasyTimeline is an example, it does something like: wfShellExec(timeline.pl), and anything sent to stderr ends up being replayed by hhvm and thus to fatal.log.

The fix would be to pass to wfShellExec [ 'duplicateStderr' => true ] then use structured logging so we get the full context in logstash.

I have some WIP toward that goal for EasyTimeline in https://gerrit.wikimedia.org/r/#/c/335238/ (poke T138036 )

The way wfShellExec handles this sucks. If we don't care about stderr, it should just be sent to devnull - there is no point in spamming the hhvm log with it as without any accompanying information it's almost certainly not going to be actionable. If we do care about it then we should not mix it into the normal output - we are using proc_open which supports multiple output streams.

The syntax is already a horrible mess so it would be best to get https://gerrit.wikimedia.org/r/#/c/319505/ merged first and build on that.

If we don't care about stderr, it should just be sent to devnull - there is no point in spamming the hhvm log with it as without any accompanying information it's almost certainly not going to be actionable. If we do care about it then we should not mix it into the normal output - we are using proc_open which supports multiple output streams.

@MaxSem already has a patch for this (https://gerrit.wikimedia.org/r/#/c/325063/), cool!

T138036 is a similar behavior coming from EasyTimeline. I have some low priority work in progress patch that would capture all output of wfShellExec AND adds structured logging so it shows up nicely in Logstash: https://gerrit.wikimedia.org/r/#/c/335238/

https://gerrit.wikimedia.org/r/#/c/325063/ is being deployed this week so error messages should be discarded from now, unless explicitly mixed into stdout with duplicateStderr. MediaWiki can still access the error output, although by default it doesn't (maybe we should change that). If all goes well, errors like these should disappear tomorrow.

And so they did. I think this and all the subtasks can be closed now.

Closing per above.
Note that https://gerrit.wikimedia.org/r/#/c/385946/ will re-add logging of stderr but it will use proper structured logging (instead of one log event per output line), a dedicated channel, include wiki/server/URL/caller method, and it can be manually suppressed where that makes sense.

Tgr claimed this task.
Tgr reassigned this task from Tgr to MaxSem.
Tgr moved this task from Pending to Done on the User-Tgr board.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM