Page MenuHomePhabricator

Consider enabling all MW log channels by default for WMF
Closed, ResolvedPublic

Description

In T228462, we had some errors in a DeferredUpdate which is caught by DeferredUpdates::run() and logged to the DeferredUpdates channel. But the errors/exception was not showing up until @tstarling configured the logging channel in mediawiki-config:

$wmgMonologChannels['default']['DeferredUpdates'] = 'error';

If I understand it properly, that implies our log configuration must explicitly list each of the log buckets. I would have assumed we have a catch all for anything that is at error or above. Instead the channels not referenced fall back to $wmgDefaultMonologHandler which is set to blackhole logs:

wmf-config/InitialiseSettings.php
// Statically configured Monolog handler to clone for log channels that are
// not specifically configured in $wmgMonologChannels.
// See $wmgMonologConfig['handlers'] in logging.php for valid values.
'wmgDefaultMonologHandler' => [
    'default' => 'blackhole',
    'testwiki' => 'wgDebugLogFile',
    'test2wiki' => 'wgDebugLogFile',
],
wmf-config/logging.php
$wmgMonologConfig = [
    'loggers' => [
        // Template for all undefined log channels
        '@default' => [
            'handlers' => (array)$wmgDefaultMonologHandler, 
            'processors' => array_keys( $wmgMonologProcessors ),
            'calls' => $wmgMonologLoggerCalls,
        ],
    ],

So unless a log channel is explicitly referenced, we are missing errors :-\ I guess that instead of using a black hole, we should at least log at error level, if not at warning level. Maybe with an extra context field to differentiate those fallback logs.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle renamed this task from Unreferenced log channels are not logged at all causing "logged" errors to be missed to Consider enabling all MW log channels by default for WMF.Aug 27 2019, 2:14 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:05 PM
thcipriani subscribed.

Does the new ECS stuff mean this is no longer valid? Or is that work orthogonal. Adding observability to answer that question/determine if this proposal will make logstash fall over.

@thcipriani would it be helpful to set a time to chat about this further? I don't know if there is an immediate plan to move MW to ECS, but lets discuss options available and see if there is a suitable path forward.

I don't think ECS formatting is related to this task.

The issue is that we configure MediaWiki logging without a catchall channel. In MediaWiki that can be done by defining a @default channel which would catch anything that is not explicitly defined. The code is in includes/debug/logger/MonologSpi.php.

I don't quite remember the discussion that happened a couple years ago but I think the fear is that we might end up overflowing the logging infrastructure by catching them all (FUD I have actually no idea). I found out that @default while revisiting this task.

I don't quite know how we can test it. Maybe add the magic @default solely for the testwikis and progressively apply it to more wiki groups?

The issue is that we configure MediaWiki logging without a catchall channel. In MediaWiki that can be done by defining a @default channel which would catch anything that is not explicitly defined. The code is in includes/debug/logger/MonologSpi.php.

We'd only need to set $wmgDefaultMonologHandler = 'logstash-info'. It's already done for testwiki and when using X-Wikimedia-Debug.

Krinkle changed the subtype of this task from "Production Error" to "Task".Apr 14 2021, 8:06 PM
Krinkle awarded a token.

Another example of time wasted discovering and addressing a lacking log channel: T300712#7918757.

Also, if we had a unified or default configuration that enabled, e.g. info or warning level and above, it would likely also encourage the (imho) best practice of temporary raising telemetry happening by changing the source of the message in question instead of "temporarily" raising the level with an override for that channel, as that would be the easier and less divergent change to make; also with the easier to understand and expect outcome that we all know what level is enabled in production generally, which right now is rather unpredictible, leading to its own mistakes/confusions.

Change #1018637 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@master] logging: default to log any error

https://gerrit.wikimedia.org/r/1018637

After five years, it is never too late. This has hit us again this week while I was deploying the MediaWiki train and I guess it is finally time to address it and log all errors.

Change #1019253 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@master] logging: always register udp2log handlers

https://gerrit.wikimedia.org/r/1019253

Change #1019253 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: always register udp2log handlers

https://gerrit.wikimedia.org/r/1019253

Mentioned in SAL (#wikimedia-operations) [2024-04-23T09:29:27Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-23T09:31:51Z] <hashar@deploy1002> hashar: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-23T09:44:39Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1019253|logging: always register udp2log handlers (T228838)]] (duration: 15m 11s)

Change #1023441 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@master] logging: do not explicitly set blackhole handler

https://gerrit.wikimedia.org/r/1023441

Change #1023441 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: do not explicitly set blackhole handler

https://gerrit.wikimedia.org/r/1023441

Mentioned in SAL (#wikimedia-operations) [2024-04-25T07:45:56Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-25T07:48:43Z] <hashar@deploy1002> hashar: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-25T08:02:13Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]] (duration: 16m 17s)

Change #1018637 abandoned by Hashar:

[operations/mediawiki-config@master] logging: default to log any error

Reason:

Abandoning for now, but I might revisit one day.

https://gerrit.wikimedia.org/r/1018637

Change #1018637 restored by Hashar:

[operations/mediawiki-config@master] logging: default to log any error

https://gerrit.wikimedia.org/r/1018637

Change #1072207 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] logging: Default to log any error (on beta and group0)

https://gerrit.wikimedia.org/r/1072207

Change #1072226 had a related patch set uploaded (by Hashar; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] logging: Default to log any error (on beta)

https://gerrit.wikimedia.org/r/1072226

I am assigning this to @matmarex since he has done the hard work when refactoring the logging configuration to a point my original patch became a one liner.

Change #1072226 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Default to log any error (on beta)

https://gerrit.wikimedia.org/r/1072226

The beta cluster should now catch catch errors from any channels that are not configured. That can be watched at https://beta-logs.wmcloud.org/

It looks like the only new errors on the beta cluster appeared on the QuickSurvey channel. That's better than I expected. Filed as T374654: Log messages at ERROR level on QuickSurvey channel: "Bad survey configuration: The XXX external survey must have a secure url.".

Change #1072207 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Default to log any error (on group0)

https://gerrit.wikimedia.org/r/1072207

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:56:07Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:58:05Z] <hashar@deploy1003> matmarex, hashar: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-16T14:04:06Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1072207|logging: Default to log any error (on group0) (T228838)]] (duration: 07m 59s)

Change #1073232 had a related patch set uploaded (by Hashar; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] logging: Default to log any error (on group1)

https://gerrit.wikimedia.org/r/1073232

The group0 deployment revealed some more interesting errors in production, I filed a task for each (see above).

Change #1073232 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Default to log any error (on group1)

https://gerrit.wikimedia.org/r/1073232

Mentioned in SAL (#wikimedia-operations) [2024-09-17T07:20:08Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-17T07:28:23Z] <hashar@deploy1003> hashar: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-17T07:35:49Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073232|logging: Default to log any error (on group1) (T228838)]] (duration: 15m 36s)

Yesterday I have modified the mediawiki-production-logging Grafana dashboard to be able to filter the log metrics by level and channel. After having enabled default logging on group1, I can see a small bump which apparently comes primarily from the http logging bucket. The objectcache and memcached have bumps while we do the deployment, which was already the case previously and is unrelated.

error_by_channels.png (576×1 px, 85 KB)

Change #1073408 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@master] logging: rm per channel 'error' logging

https://gerrit.wikimedia.org/r/1073408

Change #1018637 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Default to log any error (all wikis)

https://gerrit.wikimedia.org/r/1018637

Mentioned in SAL (#wikimedia-operations) [2024-09-17T13:02:16Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-17T13:04:24Z] <hashar@deploy1003> hashar: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-17T13:09:23Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1018637|logging: Default to log any error (all wikis) (T228838)]] (duration: 07m 06s)

Thanks for deploying all these patches @hashar!

I had another look at the new logs and filed a few more tasks, as you can see above. I'll stop reviewing them now. We've discovered some new problems, but the volume of the new logs is surprisingly non-catastrophic!


By the way, the http channel is very noisy: https://logstash.wikimedia.org/goto/15985f9cb47379a9ade6f5bf18add706

image.png (530×2 px, 91 KB)

There's a dozen different things making HTTP requests that fail, either sometimes or always. I haven't bothered filing bugs for all of them. I suppose occasional failures are expected, but it's hard to tell which ones matter and which don't, and the format of the logs makes it difficult to even tell how often each error occurs. I submitted a small patch that should make them a bit more usable: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1073507

Change #1073408 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: rm per channel 'error' logging

https://gerrit.wikimedia.org/r/1073408

Mentioned in SAL (#wikimedia-operations) [2024-09-17T20:48:23Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-17T20:50:28Z] <hashar@deploy1003> hashar: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-17T20:56:46Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073408|logging: rm per channel 'error' logging (T228838)]] (duration: 08m 22s)

I think that concludes the five years adventure to have error level logs logged by default \o/

A public view is accessible via https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging which I have polished a bit as part of this task.

Huge thanks @matmarex for your refactoring of wmf-config/logging.php which eventually made it trivial to log all errors by default.