Page MenuHomePhabricator

Consider enabling all MW log channels by default for WMF
Open, Needs TriagePublic

Description

In T228462, we had some errors in a DeferredUpdate which is caught by DeferredUpdates::run() and logged to the DeferredUpdates channel. But the errors/exception was not showing up until @tstarling configured the logging channel in mediawiki-config:

$wmgMonologChannels['default']['DeferredUpdates'] = 'error';

If I understand it properly, that implies our log configuration must explicitly list each of the log buckets. I would have assumed we have a catch all for anything that is at error or above. Instead the channels not referenced fall back to $wmgDefaultMonologHandler which is set to blackhole logs:

wmf-config/InitialiseSettings.php
// Statically configured Monolog handler to clone for log channels that are
// not specifically configured in $wmgMonologChannels.
// See $wmgMonologConfig['handlers'] in logging.php for valid values.
'wmgDefaultMonologHandler' => [
    'default' => 'blackhole',
    'testwiki' => 'wgDebugLogFile',
    'test2wiki' => 'wgDebugLogFile',
],
wmf-config/logging.php
$wmgMonologConfig = [
    'loggers' => [
        // Template for all undefined log channels
        '@default' => [
            'handlers' => (array)$wmgDefaultMonologHandler, 
            'processors' => array_keys( $wmgMonologProcessors ),
            'calls' => $wmgMonologLoggerCalls,
        ],
    ],

So unless a log channel is explicitly referenced, we are missing errors :-\ I guess that instead of using a black hole, we should at least log at error level, if not at warning level. Maybe with an extra context field to differentiate those fallback logs.

Event Timeline

+1, I always found this very annoying. I think it was originally done because Logstash was formally unsupported and not well resourced, and we had to be cautious not to overload it, but that shouldn't be the case anymore.

I always thought the convention of dropping unknown channels was just carried over from the pre-logstash system. Before Monolog, we only had wfDebugLog(), and so the only way to distinguish between debug channels and error channels was by channel name. Dropping unknown channels avoided accidentally DoSing the logging system when a new debug channel was added.

Krinkle renamed this task from Unreferenced log channels are not logged at all causing "logged" errors to be missed to Consider enabling all MW log channels by default for WMF.Aug 27 2019, 2:14 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:05 PM
thcipriani added a subscriber: thcipriani.

Does the new ECS stuff mean this is no longer valid? Or is that work orthogonal. Adding observability to answer that question/determine if this proposal will make logstash fall over.

@thcipriani would it be helpful to set a time to chat about this further? I don't know if there is an immediate plan to move MW to ECS, but lets discuss options available and see if there is a suitable path forward.

I don't think ECS formatting is related to this task.

The issue is that we configure MediaWiki logging without a catchall channel. In MediaWiki that can be done by defining a @default channel which would catch anything that is not explicitly defined. The code is in includes/debug/logger/MonologSpi.php.

I don't quite remember the discussion that happened a couple years ago but I think the fear is that we might end up overflowing the logging infrastructure by catching them all (FUD I have actually no idea). I found out that @default while revisiting this task.

I don't quite know how we can test it. Maybe add the magic @default solely for the testwikis and progressively apply it to more wiki groups?

The issue is that we configure MediaWiki logging without a catchall channel. In MediaWiki that can be done by defining a @default channel which would catch anything that is not explicitly defined. The code is in includes/debug/logger/MonologSpi.php.

We'd only need to set $wmgDefaultMonologHandler = 'logstash-info'. It's already done for testwiki and when using X-Wikimedia-Debug.

Krinkle changed the subtype of this task from "Production Error" to "Task".Apr 14 2021, 8:06 PM
Krinkle awarded a token.